Information Retrieval (Part 1)

Similar documents
Information Retrieval and Organisation

Chapter 6: Information Retrieval and Web Search. An introduction

Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search

INFS 427: AUTOMATED INFORMATION RETRIEVAL (1 st Semester, 2018/2019)

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

Document Clustering for Mediated Information Access The WebCluster Project

Social Search Networks of People and Search Engines. CS6200 Information Retrieval

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

INFORMATION RETRIEVAL SYSTEM USING FUZZY SET THEORY - THE BASIC CONCEPT

60-538: Information Retrieval

CS 6320 Natural Language Processing

Overview MULTIMEDIA INFORMATION RETRIEVAL. Search Engines. Information Retrieval. Explanation. Van Rijsbergen

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Information Retrieval

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 2013 ISSN:

Introduction to Information Retrieval

Information Retrieval CSCI

Semantic Clickstream Mining

Information Retrieval and Knowledge Organisation

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

DL User Interfaces. Giuseppe Santucci Dipartimento di Informatica e Sistemistica Università di Roma La Sapienza

21. Search Models and UIs for IR

Beyond Content-Based Retrieval:

CS490W: Web Information Search & Management. CS-490W Web Information Search and Management. Luo Si. Department of Computer Science Purdue University

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Information Retrieval. (M&S Ch 15)

Multimedia Information Systems

Definitions. Lecture Objectives. Text Technologies for Data Science INFR Learn about main concepts in IR 9/19/2017. Instructor: Walid Magdy

CS-490WIR Web Information Retrieval and Management. Luo Si

CS506/606 - Topics in Information Retrieval

Exploiting ERP Systems in Enterprise Search

Searching the Web [Arasu 01]

Information Retrieval

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Information Retrieval and Web Search

Information Retrieval

Extending the Facets concept by applying NLP tools to catalog records of scientific literature

MULTIMEDIA DATABASES OVERVIEW

Search Engines. Information Retrieval in Practice

Processing Structural Constraints

INFSCI 2140 Information Storage and Retrieval Lecture 6: Taking User into Account. Ad-hoc IR in text-oriented DS

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

5/13/2009. Introduction. Introduction. Introduction. Introduction. Introduction

Retrieval Evaluation

Joining Collaborative and Content-based Filtering

HERA: Automatically Generating Hypermedia Front- Ends for Ad Hoc Data from Heterogeneous and Legacy Information Systems

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Retrieval Evaluation. Hongning Wang

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Chapter 27 Introduction to Information Retrieval and Web Search

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

CHAPTER-26 Mining Text Databases

Mining for User Navigation Patterns Based on Page Contents

Natural Language Processing

Compound or complex object: a set of files with a hierarchical relationship, associated with a single descriptive metadata record.

Recovering Traceability Links between Code and Documentation

Search Engine Architecture. Hongning Wang

State of the Art and Trends in Search Engine Technology. Gerhard Weikum

Reading group on Ontologies and NLP:

Document Structure Analysis in Associative Patent Retrieval

Web Search: Techniques, algorithms and Aplications. Basic Techniques for Web Search

Patent documents usecases with MyIntelliPatent. Alberto Ciaramella IntelliSemantic 25/11/2012

Information mining and information retrieval : methods and applications

Part 11: Collaborative Filtering. Francesco Ricci

Development of an Ontology-Based Portal for Digital Archive Services

INFSCI 2480 Adaptive Information Systems Adaptive [Web] Search. Peter Brusilovsky.

Information Retrieval

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Information Retrieval

Automatic Construction of News Hypertext. Theodore Dalamagas

Media, Multimedia & Digital Media. Basic Concepts

Introduction to Information Retrieval. Hongning Wang

Cut as a Querying Unit for WWW, Netnews, and

Dynamic Visualization of Hubs and Authorities during Web Search

Improving the Performance of Search Engine With Respect To Content Mining Kr.Jansi, L.Radha

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

Similarity search in multimedia databases

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Multimodal Information Spaces for Content-based Image Retrieval

Agents and areas of application

DATA MINING II - 1DL460. Spring 2014"

Analysis on the technology improvement of the library network information retrieval efficiency

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

LECTURE 12. Web-Technology

Search Engine Architecture II

Overview of iclef 2008: search log analysis for Multilingual Image Retrieval

Information Technology for Documentary Data Representation

Lecture 1: Introduction and the Boolean Model

Data Modelling and. Multimedia. Databases M. Multimedia. Information Retrieval Part III. Outline

Lecture 1: Introduction and Overview

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

CS 572: Information Retrieval. Lecture 1: Course Overview and Introduction 11 January 2016

Interaction Style Categories. COSC 3461 User Interfaces. What is a Command-line Interface? Command-line Interfaces

Transcription:

Information Retrieval (Part 1) Fabio Aiolli http://www.math.unipd.it/~aiolli Dipartimento di Matematica Università di Padova Anno Accademico 2008/2009 1 Bibliographic References Copies of slides Selected scientific papers Suggested Textbooks C.D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008 (Draft available in the course Web page) Other interesting books in the course Web page 2

A definition of IR [Manning et.al. 2007] Information retrieval (IR) is finding material (usually documents) of unstructured nature (usually text) that satisfy an information need from within large collections (usually on local computer servers or on the internet). 3 1. Finding material (documents) in large collections 2. Unstructured nature as opposed to structured data: data do not have a fixed semantic/structure 3. Satisfy an information need 4

1. Large document base CD-ROM and distributed technology have given rise to larger and larger document bases (e.g. from O(10 6 ) to O(10 9 ) documents). This is the size at which the most of IR problems arise; The document base can be static (e.g. CD- ROM) or dynamic (e.g. digital libraries, the Web), centralised or distributed; 5 2. Unstructured documents Free-form expressions, usually having an information content (in DB terminology: semistructured data) Scientific papers, letters, e-mails, newspaper articles, image captions, audio transcript -> Textual IR Images, graphics, audio (spoken or not spoken), video,, stored in digital form -> Multimedia IR 6

3. Information Need A desire (possibly specified in an imprecise way) of information useful to the solution of the problem, or resources useful to a given goal; Useful (Relevant), according to the subjective opinion of the user. 7 Tipical tasks covered in IR Search ( ad hoc retrieval) Static document collection, Dynamic queries Filtering Static query, Dynamic document feeds Categorization Clustering Collaborative Filtering or Recommendation Browsing Summarization Question Answering 8

Media involved in IR Text monolingual o multilingual text Structured text (XML) OCR->Text Spoken text Hypertext Music Graphics Images Video / Animation 9 The nature of IR Information Retrieval is difficult because of the indeterminancy of relevance: The system might interpret differently from the user the meaning of the documents and/or the query, due to inherent ambiguities in natural language; The user might not know exactly what she wants (vague or imprecise information need); It is not clear what degree of relevance the user is happy with (i.e. it might not be clear to the system whether the user is recall-oriented or precisionoriented ). 10

What IR is not! Data Retrieval, as in DBs Information need cannot directly be expressed as a simple query and documents have not a precise semantic. A translation of them into logical representations is needed. In IR the set of objects to be retrieved are not clearly determined -> Slightly different retrieved sets should not be necessarily considered as a fatal error of the system User satisfaction is the issue of IR Knowledge Retrieval, as in AI In AI a fact α is inferred from a knowledge base Γ of facts expressed in a certain formalism Question Answering In QA a query an answer is returned generated from a semantic analysis of documents. Huge amount of domain knowledge needed Information Browsing, as in Hypermedia Relevant documents are retrieved by an active intervention of the user and not by a search routine The goal of a browsing task is less clear in the mind of the user 11 Evolution of IR In the past, IR systems were used only by expert librarians as reference retrieval systems in batch modality. Many libraries still use categorization hierarchies to classify their volumes The advent of novel computers and the Web have brought to efficient indexes, capable to index and displaying entire documents processing of user queries with high performance ranking algorithms which improve the quality of the answers methods to deal with multimedia interaction with the user methods to deal with distributed document collections (e.g. WWW) 12

A formal characterization An IR model can be defined by M=[D,Q,R] where D is a representation for the documents in the collection Q is a representation for the user information needs (queries) R(d i,q j ) is a ranking function which associates a real number with a query q j Q and a document representation d i D. N.B. It defines an ordering among the documents with regard to a given query q j. 13 Unlike query satisfaction in DBs, the relationship of relevance R between documents D and information needs Q is not formally defined, but is subjective, i.e. determined by the user. Therefore, unlike in DBs, effectiveness is an issue a degree of effectiveness (user satisfaction) can be defined In an IR system or model, it is necessary to choose whether to treat relevance as 1. Boolean-valued R : D Q {0,1} 2. Finite-valued R : D Q {1,..,N} 3. Infinite-valued R : D Q R Approaches 2. and 3. are definitely the most plausible, but approach 1. is the most popular, as it greatly simplifies both relevance feedback and experimentation. 14

Factors which influence the relevance Relevance is an ineffable notion, in particular it is: Subjective: two users may pose the same query and give different judgments on the same retrieved document; Dynamic: A user may judge relevant a given document at a given retrieval pass, and later judge the same document as irrelevant, or viceversa; The documents retrieved and displayed to the user may influence her relevance judgment on the documents that will be displayed to her later; Multifaceted: it is determined not just by topicality (i.e. thematic affinity), but also by authoritativeness, credibility, specificity, exhaustivity, recency, clarity, 15 R is not known to the system prior to user judgment! The system may only take a guess by computing a Retrieval Status Value RSV(d i,q j ) which depends on the model used e.g. Logical consequence in Boolean Models Similarity between vectors in Vector Space Models Probability of relevance in Probabilistic Models The range of RSV should be a totally ordered set to allow for ranking the documents according to the scores. 16