OntoExtractor: A Fuzzy-Based Approach to Content and Structure-Based Metadata Extraction
|
|
- Katrina Lawrence
- 6 years ago
- Views:
Transcription
1 OntoExtractor: A Fuzzy-Based Approach to Content and Structure-Based Metadata Extraction Paolo Ceravolo, Ernesto Damiani, Marcello Leida, and Marco Viviani Università degli studi di Milano, Dipartimento di Tecnologie dell Informazione, via Bramante, Crema (CR), Italy {ceravolo, damiani, leida, viviani}@dti.unimi.it Abstract. This paper describes OntoExtractor a tool for extracting metadata from heterogeneous sources of information, producing a quick-and-dirty hierarchy of knowledge. This tool is specifically tailored for a quick classification of semi-structured data. By this feature, OntoExtractor is convenient for dealing with a web-based data source. 1 Introduction Typically, knowledge management techniques use metadata in order to specify content, quality, type, creation, and context of a data item. A number of specialized formats for the creation of metadata exist. A typical example is the Resource Description Framework (RDF). But metadata can be stored in any format such as free text, Extensible Markup Language (XML), or database entries. All of these formats must relay on a vocabulary that can have different degree of formality. If this vocabulary is compliant to a set of logical axioms it is called an ontology. There are a number of well-known advantages in using information extracted from data instead of data themselves. On one hand, because of their small size compared to the data they describe, metadata are more easily shareable than data. Thanks to metadata sharing, information about data becomes readily available to anyone seeking it. Thus, metadata make data discovery easier and reduces data duplication. But on the other hand some important drawbacks are restraining the diffusion of metadata format. First of all, building a knowledge-base is an onerous process. The domain analysis involves different activities often difficult to integrate, because they are usually performed by different professional roles. In addition, the high cost of knowledge-base building is in contradiction to important characteristics of knowledge management principles. Any knowledge management activity need to be configured for a given domain. But every domain evolves and the knowledge-bases related to it have to evolve as well. If the domain is evolving rapidly, a dis-alignment may result between the actual domain s R. Meersman, Z. Tari, P. Herrero et al. (Eds.): OTM Workshops 2006, LNCS 4278, pp , c Springer-Verlag Berlin Heidelberg 2006
2 1826 P. Ceravolo et al. state of affairs and the knowledge-base. In addition, classical knowledge extraction technologies are not tailored for web-based data. These techniques were largely experimented with successful results. Anyhow they present some limitations. First, they need a high number of documents (typically, many thousands) to work properly. Secondly, they hardly take into account document structure and are therefore unsuitable for semi-structured document formats used on the Web. In this paper we present OntoExtractor a tool supporting knowledge extraction activities in a web-based environment. OntoExtractor was designed to be inserted in a more general system aimed at managing the whole Ontology Life Cycle [5]. The classification produced as output is transformed in a standard metadata format and proposed to a community of used. Feedbacks from the community are collected in order to refine the classification, discarding metadata expressing not relevant classes or misclassified documents [4]. In order to support continuos domain evolutions, OntoExtractor is designed for quickly producing a preliminary classification of a knowledge base. This tool supports heterogeneous source of information, including semi-strucutred data. A fuzzy representation of document vectors allows to segment documents according to their structural topology, assigning different relevance values to each segment. Another important feature of OntoExtractor is to produce different classifications organizing the classes of documents according to different degree of cohesion. This feature allows the user to quickly discard a classification not coherent to his vision of the domain. The paper is organized as follows: Section 2, introduces the tool, Section 3 describes the format adopted for document representation, Section 4 explains the techniques used in the structural classification of documents, Section 5 explains the techniques used in the content classification, while Section 6 goes to the conclusions. 2 OntoExtractor OntoExtractor is a tool, developed in the context of the KIWI project 1, which extracts metadata from heterogeneous sources of information, producing a quick-and-dirty hierarchy of knowledge. The construction of the hierarchy occurs in a bottom-up fashion: starting from the heterogeneous document set a clustering process groups documents in meaningful clusters. These clusters identify the backbone hierarchy of the ontology. Construction of the hierarchy is a three-step process, composed of the following phases: 1. Normalize the incoming documents into XML format [9]. 2. Clustering the documents according to their structure using a Fuzzy Bag representation of the XML tree [3] [6]. 1 This work was partly funded by the Italian Ministry of Research Fund for Basic Research (FIRB) under projects RBAU01CLNB 001 Knowledge Management for the Web Infrastructure (KIWI).
3 OntoExtractor: A Fuzzy-Based Approach to Content 1827 Fig. 1. Overview of the OntoExtractor process 3. Refine the structural clustering analyzing the content of the document, producing a semantic clustering of the documents. 3 Normalize the Knowledge Base This first step in our process is choosing a common representation format for the information to be managed. Data may come from different and heterogeneous sources: including unstructured, semi-structured or structured information, such as textual documents, HTML files, XML files or records in a database. In order to conciliate these different data sources we developed a set of wrapper applications transforming most used document formats in a XML target representation. The wrapping process is shown in Figure 2: for semi-structured and structured sources the wrapper does not have much to do. All it has to perform is applying a mapping between the original data and elements in the target XML tree. Unstructured sources of information need additional processing aimed to extracting the hidden structure of the documents. This phase uses well-known text-segmentation techniques [9] in order to find relations among parts of a text. This is an iterative process that takes as input a text blob (which is a continuous flow of characters, representing the whole content of a document) and gives as output a set of text-segments identified by the text segmentation process. The process stops when no text blob can be segmented further. At this point, a post-processing phase analyzes the resulting tree structure and generates the corresponding XML document. In the current version of the OntoExtractor software, a Regular Expressions matching approach is also available in order to discover regular patterns like titles of sections in the documents, helping controlling the text segmentation process. This is a preliminary approach that compares each row of the document with the regular expression (i.e. [0 9] + (([.]?) ([.]?[0 9]+)) (\s + \w+)+ we used this expression to match chapter, sections and paragraph headlines, which are usually proceeded by numbers separated by a. ).
4 1828 P. Ceravolo et al. Fig. 2. Wrapping process 4 Clustering by Structure The OntoExtractor tool uses a flat encoding for the internal representation of XML documents for processing and analysis purposes. Documents are represented as Fuzzy Bags, i.e. a collection of elements which may contain duplicates. Due to the fact that the importance of tags can differ, it is possible to assign a different weight (in the range form 0 to 1) to each tag in the document. In other words, for each element in the XML document d, the Fuzzy Bag encoding d contains a Fuzzy Element whose membership value is determined by the position of the tag in the document s structure or by other topological properties. OntoExtractor tool currently provides two different algorithms to calculate the membership function of a Fuzzy Element: 1. Nesting: this is a lossy representation of the original document s topology, because this membership value does not keep track of which is the parent tag of the current tag, as shown in Figure 3. Giving a vocabulary V = {R/1,a/0.9,b/0.8,d/0.6,e/0.4}, applying the nesting weighting function to a generic XML document, such as A.xml or B.xml, we obtain the fuzzy bag A = B = {R/1,a/0.3,a/0.225,b/0.2,d/0.3,e/0.2}. The membership value for each element is: M = V e /L. Where: M: membership value; V e : weight of the tag in the vocabulary; L: nestinglevelofthetagwithl root =0.
5 OntoExtractor: A Fuzzy-Based Approach to Content 1829 Fig. 3. Two generic XML documents A.XML and B.XML 2. MV : this is an experimental algorithm introduced by our group, which keeps memory of the parent tag. The membership value for each element is: M = (V e + M p )/L. Where: M: membership value; M p : membership value of the parent tag with M root =0; V e : weight of the tag in the vocabulary; L: nestinglevelofthetagwithl root =0. The MV membership value helps, in certain cases, to keep memory of the tree structure of the original document, referring to figure 3: using the same vocabulary V, applying the MV weighting function to the tree representation of the two XML documents A.xml and B.xml we obtain A = {R/1,a/0.53,a/0.36,b/0.33,d/0.8,e/0.7} and B = {R/1,a/0.56,a/0.37,b/0.34,d/0.8,e/0.7} which are different. Figure 4 shows the differences in processing an XML document coming from Amazon, alternatively by Nesting and MV algorithms. In order to compare the XML documents modeled as fuzzy bags well known similarity measures studied in [1] [2]. We privileged measures giving higher similarity weight to the bags where elements (tags) belonging to the intersection are less nested. This is motivated by the fact that, if a tag is near to the root it seems reasonable to assume that it has a higher semantic value. In OntoExtractor the comparison between two Fuzzy Bags is computed using Jaccard norm: Where: S(B1,B2) = Approx B1and B2 are the input fuzzy bags; is the intersection operator; is the union operator; is the cardinality operator; Approx() is the approximation operator; Sis the similarity value between B1 and B2. ( ) Bag1 Bag2 Bag1 Bag2
6 1830 P. Ceravolo et al. Fig. 4. Fuzzy Bags generated by Nesting and MV algorithms. And the XML representation of the document. For more theoretical information about this norm and how the union, intersection, approximation and cardinality operations are expressed, please refer to [3] and [6]. Using this norm the tool can perform a partitioned clustering technique that is an hybrid version between K-means and K-NN clustering algorithms. OntoExtractor uses an alpha-cut value as a threshold for the clustering process, in order to avoid to suggest the initial number of clusters (k) and skipping this way some clustering problems related to the k-means algorithm. The clustering algorithm compares all the documents with the centroid of each cluster, considering only the bigger resemblance value. If this value is bigger than the given alpha the document is inserted in the cluster, otherwise a new empty cluster is generated and the document is inserted in it. OntoExtractor tool offers two different ways to calculate the centroid of each cluster: one method chooses the document that has the smaller representative Fuzzy Bag. In this method the centroid always corresponds to a real document. The other method generates a new Fuzzy Bag as the union of all the Fuzzy Bags in the cluster. This way the generated Fuzzy Bag does not have a compulsory correspondence in a real document. 5 Clustering by Content The second clustering process that we propose is based on the content connected to leaf nodes. Content-based clustering is independent for each structural cluster
7 OntoExtractor: A Fuzzy-Based Approach to Content 1831 selected so on it is possible to give different clustering criteria for each structural cluster generated, as shown in Figure 5. Note that users can select which clustering process to perform; for instance, if there is no need of structural clustering then only content-based clustering is performed. Is important to remember that Fig. 5. Domain class subdivision based on structure (a) and refinement based on content (b) our clustering technique works on XML documents that are somehow structured. Therefore we compute content-based similarity at tag level, comparing content of the same tag between different documents. Then we compute content-based similarity at document level by aggregating tag level similarity values. Referring Fig. 6. Tag-Level ccomparison between data belonging to the same tag in different documents to Figure 6 it is necessary to choose two different functions: a function f to compare data belonging to tags with the same name in different documents: f a (a[data] A ; a[data] B ); f b (b[data] A ; b[data] B ); f c (c[data] A ; c[data] B ) (1)
8 1832 P. Ceravolo et al. and a function F to aggregate the individual fs: F (f a,f b,f c ). We have two possibilities for choosing the F function: F is a t-norm: conjunction of the single values (f a f b f c ); F is a t-conorm: disjunction of the single values (f a f b f c ). Fig. 7. A: comparison in case of null values. B: comparison in case of nested values. Referring to Figure 7 it is evident that we need to consider also cases where the tag is not present in the document and cases of documents having multiple instances of the same tag at different nesting levels. So in the first case we have: f b (null; b[data] B )=0; (2) and in the second case we evaluate the distancebetweenthetagsusingthe formula: 1 f x =max f xp,k (x p [data] p,k 1+Δ A ; x k [data] B ); p,k (3) Δ = μ(x p ) μ(x k ). (4) Occurrences of terms have distinct informative roles depending on the tags they belong to. So, it is possible either to define a different function f for each group of data of the same tag in different documents, or choosing a function considering the membership value μ(x i ) associated to the i-th tag. We represent the content of each tag (A n [data],b n [data, ]C n [data],... in (1)) with the well-known Vector Space Model,widelyusedinthemoderninformation retrieval system. The vector space model (VSM) is an algebraic model used for information filtering and information retrieval. It represents natural language documents in a formal manner by the use of vectors in a multi-dimensional space. The vector space model usually builds a documents-terms matrix and processes it to generate the document-terms vectors. Our approach is similar but we generate one matrix for each tag in the document; correspondingly, we generate a tag-terms vector. There are several methods to generate the tag-terms vector, such as LSA (Latent Semantic Analysis [7]) or SVD (Singular Value Decomposition), a well-known method of matrix reduction that adds latent semantic meaning to the vectors. In OntoExtractor, generating the tag-terms vectors is a three-step process:
9 OntoExtractor: A Fuzzy-Based Approach to Content 1833 Generating the tags-terms matrix: for each tag in the document, a documents-terms matrix is produced. It is important to remember that we do not consider the document as a unique text-blob, but we build the documents-terms matrix at the tag level. If a tag is not present in a document, a row of zeros is added to the matrix. Each entry in the matrix can be computed in several ways as well, by choosing one of the weighting methods implemented in the tool. At now it is possible to choose among: tf idf, tf df, tf and term occurrency. Transforming the matrix : once the matrix has been generated we process it by some matrix tranformations. We allow to choose between keeping the original matrix or transform it LSA by SVD. This method relies on the assumption that any m n matrix A (with (m n)) can be written as the product of an m n column-orthogonal matrix U, ann n diagonal matrix with positive or zero elements(σ), and the transpose of an n n orthogonal matrix V. Suppose M is an m n matrix whose entries come from the field K, which is either the field of real numbers or the field of complex numbers. Then there exists a factorization of the form: M = UΣV ;whereu is an m m unitary matrix over K,thematrixS is m n with non-negative numbers on the diagonal and zeros off the diagonal, and V denotes the conjugate transpose of V,ann n unitary matrix over K. Such a factorization is called a singular-value decomposition of M. ThematrixV thus contains a set of orthogonal input or analysing base-vector directions for M. The matrix U contains a set of orthogonal output base-vector directions for M. ThematrixS contains the singular values, which can be thought of as scalar gain controls by which each corresponding input is multiplied to give a corresponding output. After the matrix decomposition we generate a new n m matrix using an r-reduction of the original SVD decomposition: M = U r Σ r Vr. Only the r column vectors of U and r row vectors of V corresponding to the non-zero singular values S r are calculated. The resulting new matrix is not a sparse matrix anymore but it is densely populated by values, with hidden semantic meaning. Storing the vectors: each row in the matrix is stored in the associated tag in the document model as a new Fuzzy Bag with the terms as the element and the entry in the vector as membership value. Now tags contents are represented by Fuzzy Bags and we can compare them by mean of different distances measures: we can use traditional Euclidean distances such as the Cosine distance. 6 Conclusions and Further Work In order to avoid the siononimy and polisemy problem in the next versions of OntoExtractor will be added new processors using external ontologies to identify concept. Anyway this approach introduces other problems that have to be considered. One of this is the Word Sense Disambiguation (WSD). The validity
10 1834 P. Ceravolo et al. of this tool must be evaluated in the complete system it inserted on. Further works will provide a report on evaluations of the KIWI system. References 1. B. Bouchon-Meunier, M. Rifqi, S. Bothorel: Towards general measures of comparison of objects. Fuzzy Sets and Systems, volume 84, pages , P. Bosc, E. Damiani: Fuzzy Service Selection in a Distributed Object-Oriented Environment. IEEE Transactions on Fussy Systems, volume 9, no. 5, pages , P. Ceravolo, M.C. Nocerino, M. Viviani: Knowledge extraction from semistructured data based on fuzzy techniques. Knowledge-Based Intelligent Information and Engineering Systems, Proceedings of the 8th International Conference, KES 2004, Part III, pages , P. Ceravolo, E. Damiani, M. Viviani: Adding a Peer-to-Peer Trust Layer to Metadata Generators. Lecture Notes in Computer Science, Volume 3762, pages , P. Ceravolo, A. Corallo, E. Damiani, G. Elia, M. Viviani, and A. Zilli: Bottomup extraction and maintenance of ontology-based metadata. Fuzzy Logic and the Semantic Web, Computer Intelligence, Elsevier, E. Damiani, M.C. Nocerino, M. Viviani: Knowledge extraction from an XML data flow: building a taxonomy based on clustering technique. Current Issues in Data and Knowledge Engineering, Proceedings of EUROFUSE 2004: 8th Meeting of the EURO Working Group on Fuzzy Sets, pages , T. K. Landauer, P. W. Foltz, & D. Laham: Introduction to Latent Semantic Analysis. Discourse Processes, 25, pages , G. Salton. and C. Buckley: Term Weighting Approaches in Automatic Text Retrieval. Technical Report. UMI Order Number: TR , Cornell University. 1987, 9. G. Salton, A. Singhal, C. Buckley and M. Mitra: Automatic Text Decomposition Using Text Segments and Text Themes. Conference on Hypertext, pages 53-65, 1996.
Mining Class Hierarchies from XML Data: Representation Techniques
Mining Class Hierarchies from XML Data: Representation Techniques Paolo Ceravolo 1 and Ernesto Damiani 1 Department of Information Technology University of Milan Via Bramante, 65-26013 Crema (Italy) damiani,
More informationConception of Ontology for Security in Health Care Systems
Conception of Ontology for Security in Health Care Systems Dr. J. Indumathi Department of Information Science and Technology, Anna University, Chennai, Tamilnadu, India Abstract- The insidious and omnipresent
More informationWhich Role for an Ontology of Uncertainty?
Which Role for an Ontology of Uncertainty? Paolo Ceravolo, Ernesto Damiani, Marcello Leida Dipartimento di Tecnologie dell Informazione - Università degli studi di Milano via Bramante, 65-26013 Crema (CR),
More informationThe HMatch 2.0 Suite for Ontology Matchmaking
The HMatch 2.0 Suite for Ontology Matchmaking S. Castano, A. Ferrara, D. Lorusso, and S. Montanelli Università degli Studi di Milano DICo - Via Comelico, 39, 20135 Milano - Italy {castano,ferrara,lorusso,montanelli}@dico.unimi.it
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System
ijade Reporter An Intelligent Multi-agent Based Context Aware Reporting System Eddie C.L. Chan and Raymond S.T. Lee The Department of Computing, The Hong Kong Polytechnic University, Hung Hong, Kowloon,
More informationDocument Clustering: Comparison of Similarity Measures
Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationLATENT SEMANTIC ANALYSIS AND WEIGHTED TREE SIMILARITY FOR SEMANTIC SEARCH IN DIGITAL LIBRARY
6-02 Latent Semantic Analysis And Weigted Tree Similarity For Semantic Search In Digital Library LATENT SEMANTIC ANALYSIS AND WEIGHTED TREE SIMILARITY FOR SEMANTIC SEARCH IN DIGITAL LIBRARY Umi Sa adah
More informationVector Space Models: Theory and Applications
Vector Space Models: Theory and Applications Alexander Panchenko Centre de traitement automatique du langage (CENTAL) Université catholique de Louvain FLTR 2620 Introduction au traitement automatique du
More informationSemantic text features from small world graphs
Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK
More informationMethods for Intelligent Systems
Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering
More informationVK Multimedia Information Systems
VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Information Retrieval Basics: Agenda Vector
More informationLRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier
LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:
More informationChapter 3 Text clustering as a mining task
Chapter 3 Text clustering as a mining task F. Mandreoli, R. Martoglia & P. Tiberio Dipartimento di Ingegneria dell Informazione, Università di Modena e Reggio Emilia, Modena, Italy. Abstract In this chapter
More informationA hybrid method to categorize HTML documents
Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper
More informationLearning Probabilistic Ontologies with Distributed Parameter Learning
Learning Probabilistic Ontologies with Distributed Parameter Learning Giuseppe Cota 1, Riccardo Zese 1, Elena Bellodi 1, Fabrizio Riguzzi 2, and Evelina Lamma 1 1 Dipartimento di Ingegneria University
More informationUsing Semantic Similarity in Crawling-based Web Application Testing. (National Taiwan Univ.)
Using Semantic Similarity in Crawling-based Web Application Testing Jun-Wei Lin Farn Wang Paul Chu (UC-Irvine) (National Taiwan Univ.) (QNAP, Inc) Crawling-based Web App Testing the web app under test
More informationUnsupervised learning in Vision
Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual
More informationOptimal Decision Trees Generation from OR-Decision Tables
Optimal Decision Trees Generation from OR-Decision Tables Costantino Grana, Manuela Montangero, Daniele Borghesani, and Rita Cucchiara Dipartimento di Ingegneria dell Informazione Università degli Studi
More informationMinoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University
Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University
More informationThe Information Retrieval Series. Series Editor W. Bruce Croft
The Information Retrieval Series Series Editor W. Bruce Croft Sándor Dominich The Modern Algebra of Information Retrieval 123 Sándor Dominich Computer Science Department University of Pannonia Egyetem
More informationOntology Extraction from Heterogeneous Documents
Vol.3, Issue.2, March-April. 2013 pp-985-989 ISSN: 2249-6645 Ontology Extraction from Heterogeneous Documents Kirankumar Kataraki, 1 Sumana M 2 1 IV sem M.Tech/ Department of Information Science & Engg
More informationA Graph Theoretic Approach to Image Database Retrieval
A Graph Theoretic Approach to Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington, Seattle, WA 98195-2500
More informationClustering. Bruno Martins. 1 st Semester 2012/2013
Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts
More informationProfile Based Information Retrieval
Profile Based Information Retrieval Athar Shaikh, Pravin Bhjantri, Shankar Pendse,V.K.Parvati Department of Information Science and Engineering, S.D.M.College of Engineering & Technology, Dharwad Abstract-This
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationReading group on Ontologies and NLP:
Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini.
More informationSemantic Web Search Model for Information Retrieval of the Semantic Data *
Semantic Web Search Model for Information Retrieval of the Semantic Data * Okkyung Choi 1, SeokHyun Yoon 1, Myeongeun Oh 1, and Sangyong Han 2 Department of Computer Science & Engineering Chungang University
More informationCHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING
43 CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 3.1 INTRODUCTION This chapter emphasizes the Information Retrieval based on Query Expansion (QE) and Latent Semantic
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationRepresentation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s
Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence
More informationAn Oracle White Paper October Oracle Social Cloud Platform Text Analytics
An Oracle White Paper October 2012 Oracle Social Cloud Platform Text Analytics Executive Overview Oracle s social cloud text analytics platform is able to process unstructured text-based conversations
More informationA modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems
A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University
More informationA Patent Retrieval Method Using a Hierarchy of Clusters at TUT
A Patent Retrieval Method Using a Hierarchy of Clusters at TUT Hironori Doi Yohei Seki Masaki Aono Toyohashi University of Technology 1-1 Hibarigaoka, Tenpaku-cho, Toyohashi-shi, Aichi 441-8580, Japan
More informationMatching Techniques for Resource Discovery in Distributed Systems Using Heterogeneous Ontology Descriptions
Matching Techniques for Discovery in Distributed Systems Using Heterogeneous Ontology Descriptions S. Castano, A. Ferrara, S. Montanelli, G. Racca Università degli Studi di Milano DICO - Via Comelico,
More informationRecommender System. What is it? How to build it? Challenges. R package: recommenderlab
Recommender System What is it? How to build it? Challenges R package: recommenderlab 1 What is a recommender system Wiki definition: A recommender system or a recommendation system (sometimes replacing
More informationClustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017
Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2017 Goal: Generalize to new data Model New Data? Original Data Does the model accurately reflect new data? Supervised vs. Unsupervised
More information60-538: Information Retrieval
60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are
More informationInformation Retrieval. Information Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent
More informationA Content Vector Model for Text Classification
A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.
More informationFunction approximation using RBF network. 10 basis functions and 25 data points.
1 Function approximation using RBF network F (x j ) = m 1 w i ϕ( x j t i ) i=1 j = 1... N, m 1 = 10, N = 25 10 basis functions and 25 data points. Basis function centers are plotted with circles and data
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationText Mining: A Burgeoning technology for knowledge extraction
Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.
More informationSub-process discovery: Opportunities for Process Diagnostics
Sub-process discovery: Opportunities for Process Diagnostics Raykenler Yzquierdo-Herrera 1, Rogelio Silverio-Castro 1, Manuel Lazo-Cortés 1 1 Faculty 3, University of the Informatics Sciences. Habana,
More informationFuzzy Set-Theoretical Approach for Comparing Objects with Fuzzy Attributes
Fuzzy Set-Theoretical Approach for Comparing Objects with Fuzzy Attributes Y. Bashon, D. Neagu, M.J. Ridley Department of Computing University of Bradford Bradford, BD7 DP, UK e-mail: {Y.Bashon, D.Neagu,
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,
More informationInformation Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining
Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining 1 Vishakha D. Bhope, 2 Sachin N. Deshmukh 1,2 Department of Computer Science & Information Technology, Dr. BAM
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationUser Profiling for Interest-focused Browsing History
User Profiling for Interest-focused Browsing History Miha Grčar, Dunja Mladenič, Marko Grobelnik Jozef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia {Miha.Grcar, Dunja.Mladenic, Marko.Grobelnik}@ijs.si
More informationLesson 5 Web Service Interface Definition (Part II)
Lesson 5 Web Service Interface Definition (Part II) Service Oriented Architectures Security Module 1 - Basic technologies Unit 3 WSDL Ernesto Damiani Università di Milano Controlling the style (1) The
More informationFeature selection. LING 572 Fei Xia
Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection
More informationIntelligent flexible query answering Using Fuzzy Ontologies
International Conference on Control, Engineering & Information Technology (CEIT 14) Proceedings - Copyright IPCO-2014, pp. 262-277 ISSN 2356-5608 Intelligent flexible query answering Using Fuzzy Ontologies
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster
More informationCircle Graphs: New Visualization Tools for Text-Mining
Circle Graphs: New Visualization Tools for Text-Mining Yonatan Aumann, Ronen Feldman, Yaron Ben Yehuda, David Landau, Orly Liphstat, Yonatan Schler Department of Mathematics and Computer Science Bar-Ilan
More information2.3 Algorithms Using Map-Reduce
28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure
More informationOntology Development. Qing He
A tutorial report for SENG 609.22 Agent Based Software Engineering Course Instructor: Dr. Behrouz H. Far Ontology Development Qing He 1 Why develop an ontology? In recent years the development of ontologies
More informationXML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson
Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges
More informationIntroducing fuzzy quantification in OWL 2 ontologies
Introducing fuzzy quantification in OWL 2 ontologies Francesca Alessandra Lisi and Corrado Mencar Dipartimento di Informatica, Centro Interdipartimentale di Logica e Applicazioni Università degli Studi
More informationCommunity Detection. Community
Community Detection Community In social sciences: Community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group a.k.a. group,
More informationClustering. Distance Measures Hierarchical Clustering. k -Means Algorithms
Clustering Distance Measures Hierarchical Clustering k -Means Algorithms 1 The Problem of Clustering Given a set of points, with a notion of distance between points, group the points into some number of
More informationExploring Ancient Texts with a User Driven Concept Search
Exploring Ancient Texts with a User Driven Concept Search Muhammad Faisal Cheema, Stefan Jänicke, Christoph Weilbach, Judith Blumenstein, Gerik Scheuermann Leipzig University, Germany exchange: Exploring
More informationA ew Algorithm for Community Identification in Linked Data
A ew Algorithm for Community Identification in Linked Data Nacim Fateh Chikhi, Bernard Rothenburger, Nathalie Aussenac-Gilles Institut de Recherche en Informatique de Toulouse 118, route de Narbonne 31062
More informationmodern database systems lecture 4 : information retrieval
modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation
More informationEvaluation Methods for Focused Crawling
Evaluation Methods for Focused Crawling Andrea Passerini, Paolo Frasconi, and Giovanni Soda DSI, University of Florence, ITALY {passerini,paolo,giovanni}@dsi.ing.unifi.it Abstract. The exponential growth
More informationImage Access and Data Mining: An Approach
Image Access and Data Mining: An Approach Chabane Djeraba IRIN, Ecole Polythechnique de l Université de Nantes, 2 rue de la Houssinière, BP 92208-44322 Nantes Cedex 3, France djeraba@irin.univ-nantes.fr
More informationTERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES
TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.
More informationImproving Recognition through Object Sub-categorization
Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval
More informationA GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang
A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang University of Massachusetts Amherst Introduction Singular Value Decomposition (SVD) A: m n matrix (m n) U, V: orthogonal
More informationSimple Method for Ontology Automatic Extraction from Documents
Simple Method for Ontology Automatic Extraction from Documents Andreia Dal Ponte Novelli Dept. of Computer Science Aeronautic Technological Institute Dept. of Informatics Federal Institute of Sao Paulo
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John
More informationImproving Information Retrieval Effectiveness in Peer-to-Peer Networks through Query Piggybacking
Improving Information Retrieval Effectiveness in Peer-to-Peer Networks through Query Piggybacking Emanuele Di Buccio, Ivano Masiero, and Massimo Melucci Department of Information Engineering, University
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationThe UCD community has made this article openly available. Please share how this access benefits you. Your story matters!
Provided by the author(s) and University College Dublin Library in accordance with publisher policies., Please cite the published version when available. Title Context enabled semantic granularity Authors(s)
More informationStudying the Impact of Text Summarization on Contextual Advertising
Studying the Impact of Text Summarization on Contextual Advertising G. Armano, A. Giuliani, and E. Vargiu Intelligent Agents and Soft-Computing Group Dept. of Electrical and Electronic Engineering University
More informationAnalysis and Latent Semantic Indexing
18 Principal Component Analysis and Latent Semantic Indexing Understand the basics of principal component analysis and latent semantic index- Lab Objective: ing. Principal Component Analysis Understanding
More informationCS231A Course Notes 4: Stereo Systems and Structure from Motion
CS231A Course Notes 4: Stereo Systems and Structure from Motion Kenji Hata and Silvio Savarese 1 Introduction In the previous notes, we covered how adding additional viewpoints of a scene can greatly enhance
More informationDistributed Information Retrieval using LSI. Markus Watzl and Rade Kutil
Distributed Information Retrieval using LSI Markus Watzl and Rade Kutil Abstract. Latent semantic indexing (LSI) is a recently developed method for information retrieval (IR). It is a modification of the
More informationOntology based Web Page Topic Identification
Ontology based Web Page Topic Identification Abhishek Singh Rathore Department of Computer Science & Engineering Maulana Azad National Institute of Technology Bhopal, India Devshri Roy Department of Computer
More informationMatching and Alignment: What is the Cost of User Post-match Effort?
Matching and Alignment: What is the Cost of User Post-match Effort? (Short paper) Fabien Duchateau 1 and Zohra Bellahsene 2 and Remi Coletta 2 1 Norwegian University of Science and Technology NO-7491 Trondheim,
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationLecture Telecooperation. D. Fensel Leopold-Franzens- Universität Innsbruck
Lecture Telecooperation D. Fensel Leopold-Franzens- Universität Innsbruck First Lecture: Introduction: Semantic Web & Ontology Introduction Semantic Web and Ontology Part I Introduction into the subject
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationInformation Retrieval. hussein suleman uct cs
Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information
More informationKnowledge Engineering in Search Engines
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:
More information6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION
6 NEURAL NETWORK BASED PATH PLANNING ALGORITHM 61 INTRODUCTION In previous chapters path planning algorithms such as trigonometry based path planning algorithm and direction based path planning algorithm
More informationUsing PageRank in Feature Selection
Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important
More informationModern GPUs (Graphics Processing Units)
Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,
More informationLecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013
Lecture 24: Image Retrieval: Part II Visual Computing Systems Review: K-D tree Spatial partitioning hierarchy K = dimensionality of space (below: K = 2) 3 2 1 3 3 4 2 Counts of points in leaf nodes Nearest
More informationThis tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.
About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts
More informationThe Application Research of Semantic Web Technology and Clickstream Data Mart in Tourism Electronic Commerce Website Bo Liu
International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) The Application Research of Semantic Web Technology and Clickstream Data Mart in Tourism Electronic Commerce
More informationOutline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity
Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using
More informationDataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites
DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites H. Davulcu, S. Koduri, S. Nagarajan Department of Computer Science and Engineering Arizona State University,
More informationInternational Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X
Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,
More informationData Distortion for Privacy Protection in a Terrorist Analysis System
Data Distortion for Privacy Protection in a Terrorist Analysis System Shuting Xu, Jun Zhang, Dianwei Han, and Jie Wang Department of Computer Science, University of Kentucky, Lexington KY 40506-0046, USA
More information