Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Similar documents
Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CS 6320 Natural Language Processing

Information Retrieval

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Information Retrieval and Knowledge Organisation

Outline. Structures for subject browsing. Subject browsing. Research issues. Renardus

Chapter 2. Architecture of a Search Engine

21. Search Models and UIs for IR

Toward a Knowledge-Based Solution for Information Discovery in Complex and Dynamic Domains

Chapter 27 Introduction to Information Retrieval and Web Search

Glossary. ASCII: Standard binary codes to represent occidental characters in one byte.

Information Retrieval

What is Text Mining? Sophia Ananiadou National Centre for Text Mining University of Manchester

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

Taxonomies and controlled vocabularies best practices for metadata

SEARCH TECHNIQUES: BASIC AND ADVANCED

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Table of contents for The organization of information / Arlene G. Taylor and Daniel N. Joudrey.

In the previous lecture we went over the process of building a search. We identified the major concepts of a topic. We used Boolean to define the

Domain-specific Concept-based Information Retrieval System

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Mining Web Data. Lijun Zhang

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Introduction to Text Mining. Hongning Wang

Overview of Web Mining Techniques and its Application towards Web

DATA MINING II - 1DL460. Spring 2014"

ISSUES IN INFORMATION RETRIEVAL Brian Vickery. Presentation at ISKO meeting on June 26, 2008 At University College, London

Chapter 6: ISAR Systems: Functions and Design

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

The Security Role for Content Analysis

Knowledge Engineering with Semantic Web Technologies

Chapter 6: Information Retrieval and Web Search. An introduction

Information Retrieval (Part 1)

Part I: Data Mining Foundations

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Definitions. Lecture Objectives. Text Technologies for Data Science INFR Learn about main concepts in IR 9/19/2017. Instructor: Walid Magdy

Terminologies Services Strawman

University of Eastern Finland Library Heikki Laitinen UEF // University of Eastern Finland

Reading group on Ontologies and NLP:

Retrieval Evaluation. Hongning Wang

> Semantic Web Use Cases and Case Studies

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Ontology-based Navigation of Bibliographic Metadata: Example from the Food, Nutrition and Agriculture Journal

CHAPTER-26 Mining Text Databases

Digital Commons Workshop for Depositors

Modern Information Retrieval

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Mining Web Data. Lijun Zhang

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Text Mining. Representation of Text Documents

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects

Module 1: Internet Basics for Web Development (II)

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Document Clustering for Mediated Information Access The WebCluster Project

DL User Interfaces. Giuseppe Santucci Dipartimento di Informatica e Sistemistica Università di Roma La Sapienza

A Knowledge-Based Approach to Organizing Retrieved Documents

CHAPTER 8 Multimedia Information Retrieval

ResPubliQA 2010

Data Mining of Web Access Logs Using Classification Techniques

Knowledge Engineering and Data Mining. Knowledge engineering has 6 basic phases:

What is this Song About?: Identification of Keywords in Bollywood Lyrics

Query Refinement and Search Result Presentation

Text Mining: A Burgeoning technology for knowledge extraction

Cross Reference Strategies for Cooperative Modalities

DATA MINING - 1DL105, 1DL111

MeSH: A Thesaurus for PubMed

Creating a Classifier for a Focused Web Crawler

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms

Information Retrieval and Web Search

Introduction to Data Mining and Data Analytics

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

Metadata Standards and Applications. 6. Vocabularies: Attributes and Values

Information Management (IM)

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

USER SEARCH INTERFACES. Design and Application

Proposal for Implementing Linked Open Data on Libraries Catalogue

Mapping the library future: Subject navigation for today's and tomorrow's library catalogs

Information Retrieval

Domain Specific Search Engine for Students

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

I&R SYSTEMS ON THE INTERNET/INTRANET CITES AS THE TOOL FOR DISTANCE LEARNING. Andrii Donchenko

Sec. 8.7 RESULTS PRESENTATION

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Relevance of Google Customized Search Engine vs. CISMeF Quality- Controlled Health Gateway

Introduction to Information Retrieval. Lecture Outline

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Review on Text Mining

SEARCHING FOR EVIDENCE. Intro to Library Resources

Interaction Style Categories. COSC 3461 User Interfaces. What is a Command-line Interface? Command-line Interfaces

MSc Advanced Computer Science School of Computer Science The University of Manchester

Adding user context to IR test collections

Transcription:

Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1

Acknowledgements This lecture series has been sponsored by the European Community under the BPD program with Vilnius University as host institution 2

Use and Distribution of these Slides These slides are primarily intended for the students in classes I teach. In some cases, I only make PDF versions publicly available. If you would like to get a copy of the originals (Apple KeyNote or Microsoft PowerPoint), please contact me via email at fkurfess@calpoly.edu. I hereby grant permission to use them in educational settings. If you do so, it would be nice to send me an email about it. If you re considering using them in a commercial environment, please contact me first. 3 3

Overview Knowledge Retrieval Finding Out About Keywords and Queries; Documents; Indexing Data Retrieval Access via Address, Field, Name Information Retrieval Access via Content (Values); Parsing; Matching Against Indices; Retrieval Assessment Knowledge Retrieval Access via Structure;Meaning;Context; Usage Knowledge Discovery Data Mining; Rule Extraction 4 4

Finding Out About [Belew 2000] 5 5

Finding Out About Keywords Queries Documents Indexing [Belew 2000] 6 6

Keywords linguistic atoms used to characterize the subject or content of a document words pieces of words (stems) phrases provide the basis for a match between the user s characterization of information need the contents of the document problems ambiguity [Belew 2000] 7 7

Queries formulated in a query language natural language interaction with human information providers artificial language interaction with computers especially search engines vocabulary controlled limited set of keywords may be used uncontrolled any keywords may be used syntax [Belew 2000] 8 8

Documents general interpretation any document that can be represented digitally text, image, music, video, program, etc. practical interpretation passage of text strings of characters in an alphabet written natural language length may vary longer documents may be composed of shorter ones 9 9

Aboutness of Documents describes the suitability of a document as answer to a query assumptions all documents have equal aboutness the probability of any document in a corpus to be considered relevant is equal for all documents simplistic; not valid in reality a paragraph is the smallest unit of text with appreciable aboutness [Belew 2000] 10 10

Structural Aspects of Documents documents may be composed of documents paragraphs, subsections, sections, chapters, parts footnotes, references documents may contain meta-data information about the document not part of the content of the document itself may be used for organization and retrieval purposes can be abused by creators usually to increase the perceived relevance 11 11

Document Proxies surrogates for the real document abridged representations catalog, abstract pointers bibliographical citation, URL different media microfiches digital representations 12 12

Indexing a vocabulary of keywords is assigned to all documents of a corpus an index maps each document doc i to the set of keywords {kw j } it is about Index: doc i about {kw j } Index -1 : {kw j } describes doc i indexing of a document / corpus manual: humans select appropriate keywords automatic: a computer program selects the keywords building the index relation between documents [Belew 2000] 13 13

FOA Conversation Loop [Belew 2000] 14 14

Data Retrieval access to specific data items access via address, field, name typically used in data bases user asks for items with specific features absence or presence of features values system returns data items no irrelevant items deterministic retrieval method 15 15

Information Retrieval (IR) access to documents also referred to as document retrieval access via keywords IR aspects parsing matching against indices retrieval assessment 16 16

Diagram Search Engine [Belew 2000] 17 17

Parsing extraction of lexical features from documents mostly words may require some manipulation of the extracted features e.g. stemming of words used as the basis for automatic compilation of indices [Belew 2000] 18 18

Parsing Tools Montytagger http://web.media.mit.edu/~hugo/ montytagger/ python and Java fntbl (C++) http://nlp.cs.jhu.edu/~rflorian/fntbl/ fast Brill Tagger (C) http://www.cs.jhu.edu/~brill/ the original; influenced several later ones Natural Language Toolkit: http:// nltk.sourceforge.net/ good starting point for basics of NLP algorithms 19 19

Matching Against Indices identification of documents that are relevant for a particular query keywords of the query are compared against the keywords that appear in the document either in the data or meta-data of the document in addition to queries, other features of documents may be used descriptive features provided by the author or cataloger usually meta-data derived features computed from the contents of the document [Belew 2000] 20 20

Vector Space interpretation of the index matrix relates documents and keywords can grow extremely large binary matrix of 100,000 words * 1,000,000 documents sparsely populated: most entries will be 0 can be used to determine similarity of documents overlap in keywords proximity in the (virtual) vector space associative memories can be used as hardware implementation [Belew 2000] 21 21

Vector Space Diagram [Belew 2000] 22 22

Measuring Retrieval ideally, all relevant documents should be retrieved relative to the query posed by the user relative to the set of documents available (corpus) relevance can be subjective precision and recall relevant documents vs. retrieved documents 23 23

Document Retrieval [Belew 2000] 24 24

Precision and Recall recall retrieved relevant / relevant precision retrieved relevant / retrieved [Belew 2000] 25 25

Specificity vs. Exhaustivity [Belew 2000] 26 26

Retrieval Assessment subjective assessment how well do the retrieved documents satisfy the request of the user objective assessment idealized omniscient expert determines the quality of the response [Belew 2000] 27 27

Retrieval Assessment Diagram [Belew 2000] 28 28

Relevance Feedback subjective assessment of retrieval results often used to iteratively improve retrieval results may be collected by the retrieval system for statistical evaluation can be viewed as a variant of object recognition the object to be recognized is the prototypical document the user is looking for this document may or may not exist the difference between the retrieved document(s) and the idealized prototype indicates the quality of the retrieval results [Belew 2000] 29 29

Relevance Feedback in Vector Space relevance feedback is used to move the query towards the cluster of positive documents moving away from bad documents does not necessarily improve the results it can also be used as a filter for a constant stream of documents as in news channels or similar situations [Belew 2000] 30 30

Query Session Example [Belew 2000] 31 31

Consensual Relevance relevance feedback from multiple users identifies documents that many users found useful or interesting used by some Web sites related to collaborative filtering can also be used as an evaluation method for search engines performance criteria must be carefully considered precision and recall, plus many others [Belew 2000] 32 32

IR Diagram Corpus Query Term 1 Term 2 Term 3 Term 4 Keywords Index Documents 33 33

IR Diagram Corpus Query Term 1 Term 2 Term 3 Term 4 Keywords Index Documents 34 34

IR Diagram Corpus Query Term 1 Term 2 Term 3 Term 4 Keywords Index Documents 35 35

IR Diagram Corpus Query Term 1 Term 2 Term 3 Term 4 Keywords Index Documents 36 36

IR Diagram Corpus Query Term 1 Term 2 Term 3 Term 4 Keywords Index Documents 37 37

Knowledge Retrieval Context Usage 38 38

Context in Knowledge Retrieval in addition to keywords, relationships between keywords and documents are exploited spatial: place, directory temporal: creation date/time intermediate relations author/creator organization project explicit links hypertext related concepts thesaurus, ontology proximity 39 39

Inference beyond the Index determines relationships between documents citations are explicit references to relevant documents bibliographic references legal citations hypertext examples NEC CiteSeer <http://citeseer.nj.nec.com> Google Scholar http://scholar.google.com 40 40

Additional Information Sources [Belew 2000, after Kochen 1975] 41 41

Hypertext inter-document links provide explicit relationships between documents can be used to determine the relevance of a document for a query example: Google <http://www.google.com> intra-document links may offer additional context information for some terms footnotes, glossaries, related terms 42 42

Adaptive Retrieval Techniques fine-tuning the matching between queries and retrieved documents learning of relationships between terms training with term pairs (thesaurus) pattern detection in past queries automatic grouping of documents according to common features clustering of similar documents pre-defined categories metadata overlap in keywords consensual relevance source 43 43

Document Classification 44 44

Query Model query types (templates) frequently used types of queries e.g. problem/solution, symptoms/diagnosis, problem/further checks,... category types abstractions of query types used to determine categories or topics for the grouping of search results context information current working document/directory previous queries [Pratt, Hearst, Fagan 2000] 45 45

Terminology Model individual terms are connected to related terms thesaurus/ontology synonyms, super-/sub-classes, related terms identifies labels for the category types [Pratt, Hearst, Fagan 2000] 46 46

Matching categorizer determines the categories to be selected for the grouping of results assigns retrieved documents to the categories organizer arranges categories into a hierarchy should be balanced and easy to browse by the user depends on the distribution of the search results [Pratt, Hearst, Fagan 2000] 47 47

Results retrieved documents are grouped into hierarchically arranged categories meaningful for the user the categories are related to the query the categories are related to each other all categories have similar size not always achievable due to the distribution of documents reduced search times higher user satisfaction [Pratt, Hearst, Fagan 2000] 48 48

DynaCat knowledge-based approach to the organization of search results categorizes results into meaningful groups that correspond to the user s query uses knowledge of query types and of the domain terminology to generate hierarchical categories applied to the domain of medicine MEDLINE is an on-line repository of medical abstracts 9.2 million bibliographic entries from 3800 journals PubMed is a web-based search tool returns titles as an relevance-ranked list [DynaCat, 2000] 49 49

DyanCat Results [DynaCat, 2000] 50 50

DynaCat Query Types [DynaCat, 2000] 51 51

DynaCat Search [DynaCat, 2000] 52 52

Information vs. Knowledge Retrieval IR keywords as main components of the query index as match-making facility statistical basis for selection of relevant documents (ordered) list of results KR keywords plus context information for the query index plus ontology for matching query and documents relationships between keywords and documents influence the selection of relevant documents results are grouped into meaningful categories 53 53

KR Diagram keyword input synonym expansion relation expansion Query Term 3 Term 2 Term 4 Term 1 Keywords Term A Index Documents Corpus Term B Ontology Term C Term D Term E Term F Term H Term K Term I Term J Term L Term G Term M 54 54

Knowledge Discovery combination of Data Mining Knowledge Extraction 55 55

Data Mining identification of interesting nuggets in huge quantities of data often relations between subsets automatic or semi-automatic techniques classification, correlation (e.g. temporal, spatial) 56 56

Knowledge Extraction conversion of internal representations of knowledge into human-understandable format extraction of rules from neural networks is one example 57 57

Summary Knowledge Retrieval identification, selection, and presentation of documents relevant to a user query utilization of structural information, context, meta-data in addition to keyword search organized presentation of results categories, visual arrangement internal representations may be converted to human-understandable ones 58 58

59 59