Domain-specific Concept-based Information Retrieval System

Similar documents
GoNTogle: A Tool for Semantic Annotation and Search

Chapter 27 Introduction to Information Retrieval and Web Search

Correlation Based Feature Selection with Irrelevant Feature Removal

Ontology-Based Web Query Classification for Research Paper Searching

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

KNOWLEDGE MANAGEMENT VIA DEVELOPMENT IN ACCOUNTING: THE CASE OF THE PROFIT AND LOSS ACCOUNT

Context Ontology Construction For Cricket Video

GoNTogle: A Tool for Semantic Annotation and Search

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Ontology Refinement and Evaluation based on is-a Hierarchy Similarity

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

Encoding Words into String Vectors for Word Categorization

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

Automated Online News Classification with Personalization

An Approach To Web Content Mining

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Developing Focused Crawlers for Genre Specific Search Engines

Tansu Alpcan C. Bauckhage S. Agarwal

Ontology Matching with CIDER: Evaluation Report for the OAEI 2008

Text Mining: A Burgeoning technology for knowledge extraction

Ontology Development. Qing He

Review on Text Mining

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

THE WEB SEARCH ENGINE

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

International ejournals

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

XETA: extensible metadata System

String Vector based KNN for Text Categorization

Toward a Knowledge-Based Solution for Information Discovery in Complex and Dynamic Domains

Implementation of Smart Question Answering System using IoT and Cognitive Computing

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

60-538: Information Retrieval

Information Retrieval

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

Keyword Extraction by KNN considering Similarity among Features

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Protégé-2000: A Flexible and Extensible Ontology-Editing Environment

Cluster-based Instance Consolidation For Subsequent Matching

A hybrid method to categorize HTML documents

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

A Hierarchical Document Clustering Approach with Frequent Itemsets

Web-page Indexing based on the Prioritize Ontology Terms

International Journal of Advanced Research in Computer Science and Software Engineering

INFO216: Advanced Modelling

Semantic Web. Ontology Engineering and Evaluation. Morteza Amini. Sharif University of Technology Fall 93-94

TALP at WePS Daniel Ferrés and Horacio Rodríguez

Reading group on Ontologies and NLP:

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

Ontology Development. Farid Naimi

User Profiling for Interest-focused Browsing History

Metadata Common Vocabulary: a journey from a glossary to an ontology of statistical metadata, and back

Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task

CHAPTER-26 Mining Text Databases

Part I: Data Mining Foundations

Swathy vodithala 1 and Suresh Pabboju 2

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING

Heading-Based Sectional Hierarchy Identification for HTML Documents

Flow-based Anomaly Intrusion Detection System Using Neural Network

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Natural Language Processing with PoolParty

70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

An Ontology Based Question Answering System on Software Test Document Domain

An Adaptive Agent for Web Exploration Based on Concept Hierarchies

Fault Identification from Web Log Files by Pattern Discovery

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014

Classification. 1 o Semestre 2007/2008

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

IJSER. Keywords: Document Clustering; Ontology; semantic similarity; concept weight I. INTRODUCTION

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

Ontology Driven Focused Crawling of Web Documents

Extraction of Web Image Information: Semantic or Visual Cues?

Semantic Web. Ontology Engineering and Evaluation. Morteza Amini. Sharif University of Technology Fall 95-96

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Improving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar Youngstown State University Bonita Sharif, Youngstown State University

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

Mining Web Data. Lijun Zhang

Web-Page Indexing Based on the Prioritized Ontology Terms

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet

Relevance Feature Discovery for Text Mining

Improving the Performance of Search Engine With Respect To Content Mining Kr.Jansi, L.Radha

Keywords Data alignment, Data annotation, Web database, Search Result Record

IJCSC Volume 5 Number 1 March-Sep 2014 pp ISSN

Question Answering Systems

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

ISSUES IN INFORMATION RETRIEVAL Brian Vickery. Presentation at ISKO meeting on June 26, 2008 At University College, London

Chapter 6: Information Retrieval and Web Search. An introduction

The PROMPT Suite: Interactive Tools For Ontology Merging And. Mapping

A User Preference Based Search Engine

Transcription:

Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical Engineering, National University of Singapore, Singapore Abstract A domain-specific concept-based information retrieval system is introduced for the ease of retrieving organizational internal documents, i.e., technical reports, professional references and even emails. The system is composed of separate modules for the purpose of easy customization. Domain knowledge is organized using taxonomy for ease of knowledge storage and sharing. Finally the comparison between concept-based information retrieval and keywords-based retrieval shows the advantages of the system. Keywords Domain-specific, information retrieval I. INTRODUCTION AND MOTIVATION Concept-based, We live in an information intensive society. The problem faced by the organizations today is not lack of information but rather, too much of it. We are caught in the midst of Information Explosion. Everyday, organizations generate huge amount of data such as e- mails, reports, journals, news, memos, etc, to support their business operations. A large portion of these data s valuable information which is useful for making informed business decisions. However, with the vast amount of information generated at such an alarming rate, manual processing has become increasingly difficult, making it not only inefficient, but also ineffective in capturing and organizing the information. As a result, most of the data are not utilized for gaining insights to obtain competitive advantages for the company. In this paper, we introduce a concept-based information retrieval system. It incorporates the element of concepts modeled after the application domain to enable more effective and accurate searching. Most conventional search mechanisms are primarily based on key-word search and this poses several limitations. One difficulty springs from the fact that a single word may have different meanings depending on its usage. For example, the word lead has different meanings in such phrases as lead astray, take the lead, heavy as lead, etc. In contrast, different words or phrases may refer to the same thing. For example, the phrases - soda, pop, soft drink, cola, carbonated beverage, can all refer to the same thing. To overcome this ambiguity, we implement a concept-based framework using text mining techniques to to allow searching based on concept. In a nutshell, text mining is the art and science of extracting information and knowledge from text. Currently, most text mining systems employ some forms of statistical methods without using any context information in the application domain. In fact, the role of domain knowledge has been given little attention. Hence, we are developing a domain specific, context based information retrieval system to provide an intelligent solution for precise information delivery. This Information Retrieval System is very similar to performing a keyword search on a search engine. However, domain knowledge is used to resolve ambiguity resulted from keyword comparison, thereby enhancing the quality of the final search result. The outline of this paper is organized in following way. Section 2 overviews the system structure. Section 3 describes knowledge presentation in the format of ontology. The text analysis module which enhances the retrieval accuracy is introduced in section 4. Section 5 presents some snapshots of our system and the performance between keywords-based search and concept-based search is compared and discussed. Finally we present some conclusions and future work. II. OVERVIEW OF SYSTEM Fig 1 illustrates the information retrieval system. The domain knowledge here is Data Mining. The separate modules are introduced as follows. 1) Text: The documents are collected from various sources. Some are technical reports from different departments. Some are professional references used for research and applications, most of which are in pdf format. Useful webpages and corresponding emails are also used as inputs. 2) Preprocessing: At this step, the original texts are first transformed into text format. A NLP (Natural Language Processing) software GATE [1] was applied to parse the documents, including tokenization and partof-speech tagging. Stemming and stop-word removal were also applied to remove noise from text. Finally, the distinguished keywords/phrases are extracted to represent documents. 3) Keywords/phrase extraction: At this step, some heuristics are applied, including the frequency of each keyword/phrase, the different combination of part-ofspeech and so on. The extracted keywords/phrases are used to present the documents. Inverted document 0-7803-8519-5/04/$20.00 c 2004 IEEE 525

Domain Knowledge Text Preprocessing Key Words/Phrases Extraction Feature Selection Text Analysis Full-Text Indexing Thematic Indexing Query Fig. 1. Structure of concept-based information retrieval system indexing was used to generate full-text indexing which will be used for keyword-based retrieval 4) Feature selection: since the system is used for concept-based retrieval, it cannot use all the words in fulltext indexing. Only the representative words within concept-hierarchy are important. How to define which are differentiating and non-differentiating words are decided by domain expert. The details of domain knowledge presentation will be introduced in section 3. 5) Text analysis: Different text mining techniques are used to map documents into different concepts. Currently, document classification gives good performance in our system. 6) Thematic indexing: The documents are indexed according to concept-hierarchy. 7) Query: Our system can handle two kinds of queries. They are keyword-based search and conceptbased search. Keyword search uses double quote ( ) to distinguish from concept search and the retrieval was processed using full-text indexing. When the users issue a concept-based query, first it will be mapped to a concept node within taxonomy. The documents associated with this concept node will be returned. III. DOMAIN KNOWLEDGE REPRESENTATION One of the advantages of our framework over others is the ability to incorporate the domain knowledge specific to the application environment. The main function of the domain knowledge is to resolve semantic ambiguities such as those generated by the presence of synonyms. In a nutshell, concepts exist in the application environment are organized into ontology to provide the means for encoding background knowledge of interconcept relationships and connections to a shared vocabulary. An ontology is a formal explicit description of concepts, known as classes, in a domain of discourse [2] The properties of each class, which is called slots, describe its various features and attributes. The restrictions on these slots can be represented by the facets. An ontology together with a set of individual instances of classes constitutes a knowledge base. In practical terms, developing an ontology includes: Defining classes in the ontology, Arranging the classes in a taxonomic (subclasssuperclass) hierarchy, Defining slots and describing allowed values for these slots, Filling in the values for slots for instances. We can then create a knowledge base by defining individual instances of these classes, filling in specific slot value with information as well as additional slot restriction. We will use the task of modeling the knowledge about a Data Mining domain as an example. This model included such slots as keywords, features, kernels, parameters, and algorithms, describing the properties of an instance of the class DM_Technique. We do not attempt to build a comprehensive model of this domain but rather we use it solely as an illustration of the concepts we discuss. The Protégé [3], an open source software, was used to build taxonomy. Fig 2 illustrates a simple Data Mining taxonomy. The Data Mining is the root node. It includes two subclasses, i.e. Data Mining Techniques (DM Techniques) and Data Mining Attributes (DM Attributes). DM attributes is associated with DM Techniques, which includes algorithm, software and keywords. Actually keywords is used to describe the highest level of one certain DM technique. They are differentiating keywords which are used to distinguish this technique from others. Table I gives an example of Keywords of 526 International Engineering Management Conference 2004

DM technique - rough sets. All the words are presented in their stemmed format. The words DM technique kernel Data Mining attribute belong_to keyword feature parameter DM attribute algorithm Fig. 2. Data Mining Taxonomy relationships subclass = association = software within square bracket are concept-synonyms, which means they related to each other at the concept level. The documents which these words will be retrieved as long as one of them appeared in user s query. The words within curly bracket are keywords-synonyms. They carry the same meaning. These two kinds of synonyms are used in our prototype. All these keywords cannot be assigned into its any sub-concept node. For example, for DM technique rough sets, all words of keywords cannot be repeated under its sub-concept, e.g. algorithm or software. documents to a specific concept node as depicted in Fig 2. It can be seen as a hierarchical classification problem. At leaf level, for example, rough sets algorithm, the documents are classified into concept node using the its associated keywords. Each word of keywords is presented as one feature. Its frequency across the documents is used to represent document, like vector space model [4], except that vector space model uses more words than our keywords. The Naïve Bayes Classifier [5] is used to classify the documents against the concept-node. The reason why we choose Naïve Bayes Classifier is that the output for each object is presented by its probability belonging to a certain category. It solves the document ranking problem in retrieval system. For the middle level concept node, for example, rough sets feature, all the keywords of its subconcepts rough sets kernel and rough set parameter will be escalated and combined with the keywords associated with rough sets feature. The Naïve Bayes classifier is applied to classify the updated documents representation. In this way, the whole document sets is classified according to concept hierarchy. The classified documents are then used to build thematic indexing for convenience of retrieval. V. COMPARISON BETWEEN KEYWORD-BASED SEARCH VS CONCEPT-BASED SEARCH Fig3. gives a screenshot of our information retrieval system. The instruction on how to key in query is illustrated under the query bar. TABLE I KEYWORDS OF ROUGH SETS Differentiating KWs {rough set, RST, RS} [indiscern, equival relat], [decis tabl, inform system], [accuraci of approxim, qualiti of approxim, lower approxim, upper approxim], core, [posit region, neg region] Non-differentiating KWs [condit attribut, decis attribute] redund attribut, [conflict, inconsist], [decis rule, {rule gener, rule extract}] Each concept has its own set of keywords. These keywords are used to classify the documents into different concept node. IV. TEXT ANALYSIS Here text analysis module was used to classify the Fig. 3. User Interface of Information Retrieval System Fig. 4 presents a screenshot for results display. If one query, e.g., algorithm, belongs to more than one concept, a directory list will be returned. In this example, the rough set algorithm and SVM algorithm were returned. The user can click on the one in which he is interested and the relevant documents are returned as International Engineering Management Conference 2004 527

shown in Fig. 5. The score associated with each document is displayed in descending order. than keyword-based search, which means we can produce more relevant documents among a smaller returned document set. The recall of two kinds of search also proves that the accuracy of concept-based search is better than that of keyword-based search. VI. CONCLUSION AND FUTURE WORKS Fig. 4. User Interface for Concept Directory Fig. 5. User Interface for Displaying Results To demonstrate the effectiveness of our system compared to the conventional keyword-based search, we did the following experiments. The same words are chosen and searched either as a concept or a pure keyword. The precision and recall [6] were used to evaluate the retrieval effectiveness and accuracy. Table II shows the results. TABLE II COMPARISON BETWEEN KEYWORDS-BASED SEARCH AND CONCEPT-BASED SEARCH With the increase of e-mails, reports, journals, news, memos within organizations to support their business operations, there is the need to efficiently and effectively capture the valuable information to improve the competitiveness. Current available search engines are based on keyword search with a lot of false hits and webrelated heuristics to rank the relevance. These disadvantages make them not applicable to the internal information retrieval. Based on the analysis of requirements for current organization, we developed a domain-specific conceptbased information retrieval system. The features of system include: - Able to perform both context-based search and traditional keyword search; - Intelligent, in that it makes use of knowledge within certain application domain; - Web-based GUI to facilitate internal search - Able to migrated into other domains provided that the domain knowledge is available - Easily customized based on different requirements By comparison between the concept-based search and keyword-based search, it can be seen that our system produce more relevant documents. Currently, we are developing other text mining modules which are built on top of the current system. They are the document summarization module, which produces the brief description of retrieved documents, and the document clustering module, which is to be an alternative to the current text analysis module. We are also seeking the integration of web-crawler into our system so that in the future, the documents can be automatically collected and classified into the repository. It will help organizations expand and update their knowledge base easily. Query Concept Keyword Precision Recall Precision Recall algorithm 14/26 14/25 20/109 20/25 rough set 8/16 8/15 1/1 1/15 algorithm SVM + 13/15 13/15 13/18 13/15 kernel rough set + software 9/14 9/9 0/5 0/9 By comparing the precision and recall, it can be seen that the precision of concept-based search is much better 528 International Engineering Management Conference 2004

REFERENCES [1] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proc 40th Anniversary Meeting Association for Computational Linguistics, ACL'02, Philadelphia, July 2002. [2] N. F. Noy and D. L. McGuinness, Ontology Development 101: A Guide to Creating Your First Ontology, Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI- 2001-0880, March 2001. [3] Protégé-2000, Available : http://protege.stanford.edu/ [4] G. Salton, Dynamic document processing, Communications of ACM, vol. 17, no. 7, pp. 658-668, 1972. [5] I. Rish, An empirical study of the Naïve Bayes Classifier, in Proc IJCAI-01 workshop on Empirical Methods in AI, pp. 41-46, 2001. [6] M. Steinbach, G. Karypis and V. Kumar, A Comparison of Document Clustering Techniques, Technical Report #00-034, Department of Computer Science and Engineering, University of Minnesota, 2000. International Engineering Management Conference 2004 529