Introduction to Information Retrieval. Lecture Outline

Similar documents
Modern Information Retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 2. Architecture of a Search Engine

Chapter 27 Introduction to Information Retrieval and Web Search

Information Retrieval

60-538: Information Retrieval

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

Searching PsycInfo & Proquest Psychology

modern database systems lecture 4 : information retrieval

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

INFS 427: AUTOMATED INFORMATION RETRIEVAL (1 st Semester, 2018/2019)

CHAPTER 8 Multimedia Information Retrieval

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Automatic Document; Retrieval Systems. The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey

Introduction to Information Retrieval

INFS 427: AUTOMATED INFORMATION RETRIEVAL (1 st Semester, 2018/2019)

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

> Semantic Web Use Cases and Case Studies

ICTR UW Institute of Clinical and Translational Research. i2b2 User Guide. Version 1.0 Updated 9/11/2017

Information Retrieval and Web Search

Information Retrieval

ISSUES IN INFORMATION RETRIEVAL Brian Vickery. Presentation at ISKO meeting on June 26, 2008 At University College, London

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

SEARCH TECHNIQUES: BASIC AND ADVANCED

Table of contents for The organization of information / Arlene G. Taylor and Daniel N. Joudrey.

DATABASE SEARCHING. Instructional guide

Glossary. ASCII: Standard binary codes to represent occidental characters in one byte.

i2b2 User Guide University of Minnesota Clinical and Translational Science Institute

How to use indexing languages in searching

Introduction to Information Retrieval

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

The Performance Study of Hyper Textual Medium Size Web Search Engine

Introduction to Ovid. As a Clinical Librarían tool! Masoud Mohammadi Golestan University of Medical Sciences

CABI Training Materials. Ovid Silver Platter (SP) platform. Simple Searching of CAB Abstracts and Global Health KNOWLEDGE FOR LIFE.

Query Phrase Expansion using Wikipedia for Patent Class Search

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Developing a Test Collection for the Evaluation of Integrated Search Lykke, Marianne; Larsen, Birger; Lund, Haakon; Ingwersen, Peter

Contents. About This Book...v Audience... v Prerequisites... v Conventions... v

Semantic Web. Ontology Engineering and Evaluation. Morteza Amini. Sharif University of Technology Fall 93-94

1. Accessing EBSCOhost. 2. Why Choose EBSCOhost? 3. EBSCOhost Databases

Information Retrieval. (M&S Ch 15)

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Building Search Applications

Midterm Exam Search Engines ( / ) October 20, 2015

Information Retrieval

Simple Searching with CAB Direct

HELP DOCUMENTATION A User s Guide

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

DATA MINING - 1DL105, 1DL111

1/19/2012. Finish Chapter 1. Workers behind the Scene. CS 440: Database Management Systems

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Web of Science 8.0. Alain Frey, Customer Education and Support

Outline. Structures for subject browsing. Subject browsing. Research issues. Renardus

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

IBE101: Introduction to Information Architecture. Hans Fredrik Nordhaug 2008

MLA International Bibliography

Definitions. Lecture Objectives. Text Technologies for Data Science INFR Learn about main concepts in IR 9/19/2017. Instructor: Walid Magdy

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

NNMI140 - Network Node Manager i Software 9.x for Operators

DCMI Abstract Model - DRAFT Update

Enhancing E-Journal Access In A Digital Work Environment

Searching For Healthcare Information

COSC 431 Information Retrieval. Phrase Search & Structured Search

A Model for Interactive Web Information Retrieval

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

How to Build a Digital Library

An Introduction to PubMed Searching: A Reference Guide

Information Retrieval

Information Retrieval Overview

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Finding Images in an OPAC: Analysis of User Queries & Subject Search for Images

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

SwetsWise End User Guide. Contents. Introduction 3. Entering the platform 5. Getting to know the interface 7. Your profile 8. Searching for content 9

Information Retrieval: Retrieval Models

INFORMATION RETRIEVAL SYSTEMS: Theory and Implementation

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Table of Contents. Page 1 of 51

Informatics 1: Data & Analysis

Searching for Medical Literature

The Functional Extension Parser (FEP) A Document Understanding Platform

Step 2: Search & document. 2-1,2,3 Plan search 2-4 Develop main search 2-5 Save & document 2-6 Translate search

ResPubliQA 2010

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

Chapter 3 - Text. Management and Retrieval

Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines

Natural Language Processing

Using the THE U of T LIBRARY CATALOGUE to find books & journals

Using Attribute Grammars to Uniformly Represent Structured Documents - Application to Information Retrieval

NHS e-referral Service

i2b2 User Guide Informatics for Integrating Biology & the Bedside Version 1.0 October 2012

Scuola di dottorato in Scienze molecolari Information literacy in chemistry 2015 SCOPUS

UNIMARC and XML at the BN. Nuno Freire José Borbinha Hugo Manguinhas INESC-ID

EBSCO Searching Tips User Guide. support.ebsco.com

Information Retrieval

Longitudinal Linkage of Cross-Sectional NCDS Data Files Using SPSS

Transcription:

Introduction to Information Retrieval Lecture 1 CS 410/510 Information Retrieval on the Internet Lecture Outline IR systems Overview IR systems vs. DBMS Types, facets of interest User tasks Document representations Queries Retrieval Evaluation 1

Information retrieval Information retrieval (IR) deals with the representation, storage, organization of, and access to information items. - Baeza-Yates and Berthier Ribeiro-Neto in Modern Information Retrieval, p. 1 Information retrieval is often regarded as being synonymous with document retrieval and, nowadays, with text retrieval, implying that the task of an IR system is to retrieve documents or texts with information content that is relevant to a user s information need... the approaches that have been developed for this purpose are also applicable to a whole family of related information processing tasks that lie between, on the one hand, data retrieval and, on the other, fact or knowledge retrieval. - Sparck Jones and Willett in Readings in Information Retrieval, p. 1 Information Retrieval History Early work in IR in the 50 s and 60 s Roots in library science much older, e.g. Dewey Decimal system 1870s Library of Congress Classification 1890s Important Components: Indexing Searching User-system interaction 2

Typical information retrieval (IR) system Should I use penicillin to treat this patient with endocarditis? User: Information need User: Determines relevance formulates treat endocarditis penicillin Query: Syntax applied to returned to 1 /\/\/\/\/\/\/\/\/\/\ 2 \/\/\/\/\/\/\/\/\/\ 3 /\/\/\/\/\/\/\/\/\/\ 4 /\/\/\/\/\/\/\/\/\/\ 5 /\/\/\/\/\/\/\/\/\/\ matches, ranks, Retrieval set: Order returns represented in Index: Language Documents IR System IR systems vs. DBMS DBMS IR System Target Queries / Query language Matching Results 3

General types of IR systems Web Full text documents Bibliographic Distributed variations Metasearch Virtual document collections 4

5

6

Facets of interest in IR systems Site single controlled website (e.g. intranet) distributed across multiple sites Facets of interest in IR systems Documents Hyperlinked? Format? HTML PDF Word processed Scanned OCR? Type? Text? Multimedia? Semistructured (XML)? Static or dynamic? 7

Facets of interest in IR systems Collection Closed or open? Curated? Static or constantly changing? Content? specific domain? general/any domain? Purpose? Audience? Access? (restricted or not) Facets of interest in IR systems Search engine Supported tasks: search, browse Basic model: Boolean vs. ranked results Indexing language: controlled keywords vs. natural language Indexing target: bibliographic data vs. full text Search syntax; available operations 8

User Tasks Mode Search (retrieve) Browse Purpose Overview Question answering/fact finding Comprehensive research Finding known item (document, page, or site) Transaction (e.g. buy a book, download a file) User Tasks Examples of queries to: Find an overview? Find a fact/answer a question? Find comprehensive information? Find a known item (document, page, or site)? Find a site to execute a transaction (e.g. buy a book, download a file)? 9

Document Representation: Logical Goal: Represent the content Represent other aspects (sometimes) Methods: Assign descriptors (usually selected from a predetermined list) Extract features (usually words or phrases if text document) Other (e.g., try to anticipate what questions the document could answer) Document Representation: Logical History Early systems mostly bibliographic Systems contained brief surrogates for documents Title, author, abstract, location of full document Remember card catalogues?? Provided reference to the full document Limited by available storage and processing power Affected the way document content was represented 10

Document Representation: Logical Methods: Manual usually assign terms from a controlled vocabulary Automatic usually extract terms from the document Considerations: Size of the representation Improve likelihood of appropriate matching to queries Document Representation: Logical Abstract: The objectives of this study were to determine if (1) children with migraine experience greater sleep disturbances than their siblings, (2) those with more severe migraine have greater levels of sleep disruption, and (3) these sleep disturbances lead to greater behavioral problems and more missed school. Children aged 6 to 18 years with a diagnosis of migraine for > 6 months, who had at least one sibling without migraine in the same age range, were identified through our neurology clinic database or at the time of the clinic visit. Parents completed the (1) demographic, general health, and migraine information questionnaire; (2) Child Sleep Habits Questionnaire; and (3) Behavior Assessment System for Children: Second Edition (BASC-2) Parent Rating Scales for each child. Cases with migraine had higher total sleep (P <.02), sleep delay (P <.03), and daytime sleepiness scores (P <.001) than controls. Cases with more severe migraines had higher total sleep (P <.01) and sleep duration scores (P <.03) than those with milder headaches. In cases, higher total sleep... Assigned Indexing Terms: Activities of Daily Living Adolescent Case-Control Studies Child Child Behavior Disorders/*complications/psychology Circadian Rhythm/physiology Female Humans Male Migraine Disorders/*complications/psychology Severity of Illness Index Sleep Disorders/*complications/psychology Sleep Stages/physiology Wakefulness/*physiology 11

Document Representation: Logical Techniques to improve representation Remove stopwords Stemming Use document structure Possible text processing steps Determine what to index (frames? page title? metadata?) Strip formatting (e.g. html tags, w.p. instr.) Recognize structure Normalize representation (Normalize what?) Recognize words (or phrases) Document Representation: Physical Associate document identifier with various descriptors If descriptors are extracted terms, may also include Frequency of terms Position of terms Considerations Efficient storage Representations that can be searched to allow a fast response to user requests 12

Queries User has an information need Information need communicated to IR system as a request Request encoded in a query expression Query expression must be interpretable by the system (query language) Encoded in the interface Query expressions that use symbols to express operators Retrieval Match information request representation (query) to document representations (stored in an index) Use an algorithm to compute the matching Exact match (Boolean expressions) How should results be ordered? Ranked list based on similarity computation (and possibly other factors) 13

IR as Classification Indexing techniques classify documents Keywords, either assigned or extracted, define categories Information requests also identify one or more categories of interest Compare to bricks-and-mortar libraries? Evaluation How can we evaluate performance of an IR system? System perspective User perspective Relevance (How well) does a document satisfy the user s need (or anomalous state of knowledge )? 14

Evaluation: Metrics There are many For now, two basics: Retrieved docs Retr. & Rel. Relevant docs Recall = # documents retrieved and relevant # documents relevant Precision = # documents retrieved and relevant # documents retrieved Questions? Next: Overview of text processing 15