Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Similar documents
Information Retrieval. Information Retrieval and Web Search

Information Retrieval and Web Search

CS 6320 Natural Language Processing

CS105 Introduction to Information Retrieval

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)

60-538: Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39

Information Retrieval

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

Information Retrieval

Introduction to Information Retrieval

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction to Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval and Text Mining

Classic IR Models 5/6/2012 1

Introduction to Information Retrieval

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

Chapter 6: Information Retrieval and Web Search. An introduction

Part I: Data Mining Foundations

1Boolean retrieval. information retrieval. term search is quite ambiguous, but in context we use the two synonymously.

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Advanced Retrieval Information Analysis Boolean Retrieval

Introduction to Information Retrieval IIR 1: Boolean Retrieval

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Information Retrieval

Faculty Pages on the RCC Website

Introduction to Information Retrieval

Information Retrieval. (M&S Ch 15)

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Lecture 1: Introduction and the Boolean Model

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Information Retrieval

Building Search Applications

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Overview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval

Search: the beginning. Nisheeth

Chapter 2. Architecture of a Search Engine

Information Retrieval. Chap 8. Inverted Files

Introduction to Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

UNICAL, 21/10/2004. Tutorial goals

Information Retrieval CSCI

CS 572: Information Retrieval. Lecture 2: Hello World! (of Text Search)

Lecture 1: Introduction and Overview

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Introduction to Information Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

CSCI 5417 Information Retrieval Systems Jim Martin!

Information Retrieval

Text Mining. Representation of Text Documents

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Information Retrieval Spring Web retrieval

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Recap: lecture 2 CS276A Information Retrieval

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Information Retrieval

Question Answering Systems

WordNet-based User Profiles for Semantic Personalization

Information Retrieval

Information Technology for Documentary Data Representation

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Information Retrieval

Outline of the course

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Search Engines Information Retrieval in Practice

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Information Retrieval

Text Mining: A Burgeoning technology for knowledge extraction

PV211: Introduction to Information Retrieval

Part 2: Boolean Retrieval Francesco Ricci

Glossary. ASCII: Standard binary codes to represent occidental characters in one byte.

Definitions. Lecture Objectives. Text Technologies for Data Science INFR Learn about main concepts in IR 9/19/2017. Instructor: Walid Magdy

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval (Part 1)

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Information Retrieval and Knowledge Organisation

TEXT MINING APPLICATION PROGRAMMING

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

Unstructured Data. CS102 Winter 2019

Information Retrieval

Transcription:

Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu

Information Retrieval Information Retrieval (IR) is finding material of an unstructured nature that satisfies an information need from within large collections. Examples of large collections and informations needs: 1) Large corpus of literary texts: Find Shakespeare plays that talk about the meaning of life. 2) World Wide Web: Find affordable hotels on the beach in Destin, Florida. 3) My computer: Find files that contain the words information retrieval. 2

Typical IR task Input: A large collection of unstructured text documents. A user query expressed as text. Output: A ranked list of documents that are relevant to the query Document corpus Query String IR System Ranked Documents 1. Doc1 2. Doc2 3. Doc3.. 3

IR on a Large Text Corpus 1. Find Shakespeare plays that talk about the meaning of life : Information Need expressed as a string Query: Boolean: Naïve: meaning AND life Better: (meaning OR signify) AND life Phrase: the meaning of life Proximity: meaning NEAR life Keywords: meaning life Material of an unstructured nature: text documents (plays). 4

IR on the Web (Web Search) Find affordable hotels on the beach in Destin, Florida : Information Need, typically expressed as a keyword query: Keywords: 3 star hotel on the beach in Destin FL. Material of an unstructured nature: Text (unstructured) HTML (semistructured). Exploit the HTML structure. Exploit the link structure of the Web (PageRank, HITS). 5

IR on My Computer (Personal IR) Find files that contain the words Information Retrieval : Information Need, typically expressed as a keyword query: Keywords: information retrieval Interpreted as a conjunctive Boolean query in MS Vista Instant Search and Mac OS X Spotlight:» Boolean: information AND retrieval Material of an unstructured nature: Need to handle a broad range of documents types: Text, HTML, XML, PDF, ODT, DOCX, PPTX, 6

Information Retrieval vs. Database Search Information Retrieval: Finding information in unstructured repositories (text). Queries: Boolean, keyword, phrase, proximity, 3 star hotel on the beach in Destin FL Database Search: Finding information stored in structured repositories (relational databases, graph databases, etc.). Queries: SQL, SPARQL, RPQ, Cypher, SELECT * FROM Book WHERE price > 100 ORDER BY title; 7

(Semi)Structured Information Retrieval (Semi)Structured IR: find information in text with markup: Queries combine textual criteria with structural criteria: Digital libraries: give me a full-length article on fast fourier transforms Patent DBs: give me patents whose claims mention RSA public key encryption and that cite US patent 4,405,829. Entity-tagged text: give me articles about sightseeing tours of the Vatican and the Coliseum. Markup languages: HTML, XML, ODT (OpenOffice), 8

Typical IR task Input: A large collection of unstructured text documents. A user query expressed as text. Output: A ranked list of documents that are relevant to the query Document corpus Query String IR System Ranked Documents 1. Doc1 2. Doc2 3. Doc3.. 9

Relevance Relevance is a subjective judgment and may include: Being on the subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the user s information need i.e. his/her goals and intended use of the information. Find Shakespeare plays that talk about the meaning of life. Typically expressed as a Query String: meaning of life 10

From Queries to Relevant Documents Phrase Queries: Simplest notion of relevance is that the query string appears verbatim in the document. meaning of life Keyword Queries: Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag-of-words). meaning life 11

Find Shakespeare plays that talk about the meaning of life Keyword Query: meaning life Tomorrow, and tomorrow, and tomorrow, Creeps in this petty pace from day to day, To the last syllable of recorded time; And all our yesterdays have lighted fools The way to dusty death. Out, out, brief candle! Life's but a walking shadow, a poor player That struts and frets his hour upon the stage And then is heard no more. It is a tale Told by an idiot, full of sound and fury Signifying nothing. Skakespeare s Macbeth (Act 5, Scene 5, lines 17-28) 12

Find Shakespeare plays that talk about the meaning of life Keyword Query: meaning life Tomorrow, and tomorrow, and tomorrow, Creeps in this petty pace from day to day, To the last syllable of recorded time; And all our yesterdays have lighted fools The way to dusty death. Out, out, brief candle! Life's but a walking shadow, a poor player That struts and frets his hour upon the stage => need to bridge the Lexical Gap And then is heard no more. It is a tale Told by an idiot, full of sound and fury Signifying nothing. Skakespeare s Macbeth (Act 5, Scene 5, lines 17-28) 13

Find Shakespeare plays that talk about the meaning of life Boolean Query: (meaning OR signify) AND life Tomorrow, and tomorrow, and tomorrow, Creeps in this petty pace from day to day, To the last syllable of recorded time; And all our yesterdays have lighted fools The way to dusty death. Out, out, brief candle! Life's but a walking shadow, a poor player That struts and frets his hour upon the stage And then is heard no more. It is a tale Told by an idiot, full of sound and fury Signifying nothing. Skakespeare s Macbeth (Act 5, Scene 5, lines 17-28) 14

From Information Retrieval (IR) to Question Answering (QA) Tomorrow, and tomorrow, and tomorrow, Creeps in this petty pace from day to day, To the last syllable of recorded time; And all our yesterdays have lighted fools The way to dusty death. Out, out, brief candle! Life's but a walking shadow, a poor player That struts and frets his hour upon the stage And then is heard no more. It is a tale Told by an idiot, full of sound and fury Signifying nothing. Skakespeare s Macbeth (Act 5, Scene 5, lines 17-28) 15

From Information Retrieval (IR) to Question Answering (QA) Tomorrow, and tomorrow, and tomorrow, Creeps in this petty pace from day to day, To the last syllable of recorded Q: What time; is the meaning of life? And all our yesterdays have A: lighted Nothing! fools The way to dusty death. Out, out, brief candle! Life's but a walking shadow, a poor player That struts and frets his hour upon the stage And then is heard no more. It is a tale Told by an idiot, full of sound and fury Signifying nothing. Skakespeare s Macbeth (Act 5, Scene 5, lines 17-28) 16

Question Answering vs. Information Retrieval QA enables users to express information needs through questions in natural language. Answer in QA is focused, typically a noun phrase for factual QA. Answer in IR is a ranked list of relevant documents. QA needs deeper linguistic processing of the text more difficult than classical keyword based IR: Coreference Resolution. Syntactic/Dependency Parsing. Word Sense Disambiguation. 17

Problems with Simple Keyword-based IR May not retrieve relevant documents that include synonymous terms. meaning vs. signifying In this course: FL vs. Florida We will cover the basics of keyword-based IR. May retrieve irrelevant documents that include polysemous terms. Also address more complex techniques for intelligent IR. Python (baseball vs. mammal) Apple (company vs. fruit) play (theater play vs. act of playing) 18

Intelligent IR Take into account the meaning of the words used. Take into account the order of words in the query. Adapt to the user based on automatic or semi-automatic feedback. Expand search query with related terms. Perform automatic spell checking / diacritics restoration. Take into account the authority of the source. 19

Classic IR Models Each document represented by a set of representative keywords or index terms. An index term is a document word useful for remembering the document main themes. Index terms may be selected to be only nouns, since nouns have meaning by themselves: Should reduce the size of the index.... But it requires the identification of nouns Part of Speech tagger However, search engines assume that all words are index terms (full text representation). 20

Classic IR Models Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents The importance of the index terms is represented by weights associated to them. Let: k i be an index term d j be a document w ij is a weight associated with (k i,d j ) The weight w ij quantifies the importance of the index term for describing the document contents. 21

IR System Components Text Operations form index words (tokens). Tokenization. Stopword removal. Stemming. Indexing constructs an inverted index of word to document pointers. Mapping from tokens to document IDs. Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious 22

IR System Components Searching retrieves documents that contain a given query token from the inverted index. Ranking scores all retrieved documents according to a relevance metric. User Interface manages interaction with the user: Query input and document output. Relevance feedback. Visualization of results. Query Operations transform the query to improve retrieval: Query expansion using a thesaurus. Query transformation using relevance feedback. 23

Relevant Disciplines Natural Language Processing: Tokenization & Stemming. Part-Of-Speech (POS) tagging. Syntactic Parsing, Word Sense Disambiguation, Information Extraction, Artificial Intelligence: Focused on the representation of knowledge, reasoning, and intelligent action. Formalisms for representing knowledge and queries: First-order Predicate Logic. Bayesian Networks. 24

Relevant Disciplines Machine Learning: Text Categorization: Automatic hierarchical classification (Yahoo). Adaptive filtering/routing/recommending. Automated spam filtering. Text Clustering: Clustering of IR query results. Automatic formation of hierarchies (Yahoo). Learning to rank relevant documents. Learning models for basically any relevant NLP task: Tokenization, POS tagging, syntactic parsing, WSD, 25

Relevant Disciplines Linear Algebra: Vector Space Models. Latent Semantic Indexing. Link Analysis. Probability and Statistics: Probabilistic IR. Language Models for IR. Link Analysis. 26

Course Topics (Tentative) 1. Classical IR models: Boolean & Vector Space Models. Text operations & Indexing 2. Probabilistic IR. 3. Language Models for IR. 4. Evaluation of IR performance. 5. Relevance feedback and query expansion. 6. Web Search: Web crawling. Link analysis (PageRank, Hubs and Authorities). 27

Course Topics (Tentative) 7. Text Classification and Clustering. 8. Personalized IR. 9. Question Answering. o o Tutorials: Python & NLTK. Background: Linear Algebra, Probability and Statistics. 28