INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Similar documents
INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval and Text Mining

Information Retrieval and Organisation

Information Retrieval CS-E credits

Information Retrieval

Information Retrieval

More about Posting Lists

Web Information Retrieval. Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries

IR System Components. Lecture 2: Data structures and Algorithms for Indexing. IR System Components. IR System Components

Recap of the previous lecture. Recall the basic indexing pipeline. Plan for this lecture. Parsing a document. Introduction to Information Retrieval

F INTRODUCTION TO WEB SEARCH AND MINING. Kenny Q. Zhu Dept. of Computer Science Shanghai Jiao Tong University

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

CSE 7/5337: Information Retrieval and Web Search The term vocabulary and postings lists (IIR 2)

CS276 Information Retrieval and Web Search. Lecture 2: Dictionary and Postings

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

NPFL103: Information Retrieval (1) Introduction, Boolean retrieval, Inverted index, Text processing

Introduction to Information Retrieval

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

Introduction to Information Retrieval

Information Retrieval and Web Search

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

CS105 Introduction to Information Retrieval

Information Retrieval and Organisation

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39

DD2476: Search Engines and Information Retrieval Systems

Digital Libraries: Language Technologies

terms September 21, 2017 adapted from Schütze, Center for Information and Language Processing, University of Munich

Information Retrieval and Text Mining

Text Pre-processing and Faster Query Processing

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

n Tuesday office hours changed: n 2-3pm n Homework 1 due Tuesday n Assignment 1 n Due next Friday n Can work with a partner

Information Retrieval and Web Search Engines

Introduction to Information Retrieval

Basic Text Processing

Natural Language Processing

Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Part 3: The term vocabulary, postings lists and tolerant retrieval Francesco Ricci

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Information Retrieval

Information Retrieval

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Information Retrieval

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

CS6322: Information Retrieval Sanda Harabagiu. Lecture 2: The term vocabulary and postings lists

2018 EE448, Big Data Mining, Lecture 8. Search Engines. Weinan Zhang Shanghai Jiao Tong University

CSCI 5417 Information Retrieval Systems Jim Martin!

Information Retrieval

Introduction to Information Retrieval IIR 1: Boolean Retrieval

Corso di Biblioteche Digitali

PV211: Introduction to Information Retrieval

CS 572: Information Retrieval. Lecture 2: Hello World! (of Text Search)

Information Retrieval

CS347. Lecture 2 April 9, Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Lecture 4: Indexing. Recap: Inverted Indexes. Indexing. 1. Character Sequence Decoding. Document Preparation

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Recap: lecture 2 CS276A Information Retrieval

Search: the beginning. Nisheeth

Lecture 1: Introduction and Overview

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval. Lecture 2 - Building an index

Information Retrieval

Part 2: Boolean Retrieval Francesco Ricci

Information Retrieval. Chap 8. Inverted Files

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

Outline of the course

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Advanced Retrieval Information Analysis Boolean Retrieval

Made by: Ali Ibrahim. Supervisor: MR. Ali Jnaide. Class: 12

Information Retrieval

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Informa(on Retrieval

PV211: Introduction to Information Retrieval

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)

1Boolean retrieval. information retrieval. term search is quite ambiguous, but in context we use the two synonymously.

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Lecture 2: Encoding models for text. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Challenge. Case Study. The fabric of space and time has collapsed. What s the big deal? Miami University of Ohio

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Classic IR Models 5/6/2012 1

Preliminary draft (c)2006 Cambridge UP

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Informa(on Retrieval

Text Retrieval and Web Search IIR 1: Boolean Retrieval

Information Retrieval

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

Information Retrieval and Web Search

Multimedia Information Extraction and Retrieval Indexing and Query Answering

Information Retrieval

Transcription:

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 2: The term vocabulary and postings lists Paul Ginsparg Cornell University, Ithaca, NY 1 Sep 2009 1/ 42

Administrativa Course Webpage: http://www.infosci.cornell.edu/courses/info4300/2009fa/ Lectures: Tuesday and Thursday 11:40-12:55, Hollister B14 Instructor: Paul Ginsparg, ginsparg@cornell.edu, 255-7371, Cornell Information Science, 301 College Avenue Instructor s Assistant: Corinne Russell, crussell@cs.cornell.edu, 255-5925, Cornell Information Science, 301 College Avenue Instructor s Office Hours: Wed 1-2pm or e-mail instructor to schedule an appointment Teaching Assistant TBD, @cornell.edu The Teaching Assistant does not have scheduled office hours but is available to help you by email. 2/ 42

More Administrativa (Omitted last time, and no one asked until afterwards...) During this course there will be four assignments which require programming (due last year 21 Sep, 11 Oct, 9 Nov, 5 Dec), and two examinations: a mid-term and a final. The course grade will be based on course assignments, participation in the discussions, and the examinations. The weightings given to these components are expected to be as follows (but may be changed): Assignments 40%, Participation 20%, Examinations 40%. Course text at: http://informationretrieval.org/ 3/ 42

Overview 1 Recap 2 The term vocabulary General + Non-English English 4/ 42

Outline 1 Recap 2 The term vocabulary General + Non-English English 5/ 42

Major Steps 1. Collect documents 2. Tokenize text 3. linguistic preprocessing 4. Index documents 6/ 42

Inverted index For each term t, we store a list of all documents that contain t. Brutus 1 2 4 11 31 45 173 174 Caesar 1 2 4 5 6 16 57 132... Calpurnia 2 31 54 101. }{{}}{{} dictionary postings 7/ 42

Intersecting two postings lists Brutus 1 2 4 11 31 45 173 174 Calpurnia 2 31 54 101 Intersection = 2 31 Linear in the length of the postings lists. 8/ 42

Constructing the inverted index: Sort postings term docid term docid I 1 ambitious 2 did 1 be 2 enact 1 brutus 1 julius 1 brutus 2 caesar 1 capitol 1 I 1 caesar 1 was 1 caesar 2 killed 1 caesar 2 i 1 did 1 the 1 enact 1 capitol 1 hath 1 brutus 1 I 1 killed 1 I 1 me 1 i 1 = so 2 it 2 let 2 julius 1 it 2 killed 1 be 2 killed 1 with 2 let 2 caesar 2 me 1 the 2 noble 2 noble 2 so 2 brutus 2 the 1 hath 2 the 2 told 2 told 2 you 2 you 2 caesar 2 was 1 was 2 was 2 ambitious 2 with 2 9/ 42

Westlaw: Example queries Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company Query: trade secret /s disclos! /s prevent /s employe! Information need: Requirements for disabled people to be able to access a workplace Query: disab! /p access! /s work-site work-place (employment /3 place) Information need: Cases about a host s responsibility for drunk guests Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest 10/ 42

Outline 1 Recap 2 The term vocabulary General + Non-English English 11/ 42

Terms and documents Last lecture: Simple Boolean retrieval system Our assumptions were: We know what a document is. We know what a term is. Both issues can be complex in reality. We ll look a little bit at what a document is. But mostly at terms: How do we define and process the vocabulary of terms of a collection? 12/ 42

Parsing a document Before we can even start worrying about terms......need to deal with format and language of each document. What format is it in? pdf, word, excel, html etc. What language is it in? What character set is in use? Each of these is a classification problem, which we will study later in this course Alternative: use heuristics 13/ 42

Format/Language: Complications A single index usually contains terms of several languages. Sometimes a document or its components contain multiple languages/formats. French email with Spanish pdf attachment What is the document unit for indexing? A file? An email? An email with 5 attachments? A group of files (ppt or latex in HTML)? Issues with books (again precision vs. recall) 14/ 42

What is a document? Take-away: potentially non-trivial; in many cases requires some design decisions. 15/ 42

Outline 1 Recap 2 The term vocabulary General + Non-English English 16/ 42

Definitions Word A delimited string of characters as it appears in the text. Term A normalized word (case, morphology, spelling etc); an equivalence class of words. Token An instance of a word or term occurring in a document. Type The same as a term in most cases: an equivalence class of tokens. 17/ 42

Type/token distinction: Example In June, the dog likes to chase the cat in the barn. How many word tokens? How many word types? 18/ 42

Recall: Inverted index construction Input: Friends, Romans, countrymen. So let it be with Caesar... Output: friend roman countryman so... Each token is a candidate for a postings entry. What are valid tokens to emit? 19/ 42

Why tokenization is difficult even in English Example: Mr. O Neill thinks that the boys stories about Chile s capital aren t amusing. Tokenize this sentence (neill, oneill, o neill, o neill, o neill) (aren t, arent, are n t, aren t) choices determine which Boolean queries will match 20/ 42

One word or two? (or several) Hewlett-Packard State-of-the-art co-education the hold-him-back-and-drag-him-away maneuver data base San Francisco Los Angeles-based company cheap San Francisco-Los Angeles fares York University vs. New York University 21/ 42

Numbers 3/20/91 20/3/91 Mar 20, 1991 B-52 100.2.86.144 (800) 234-2333 800.234.2333 Older IR systems may not index numbers......but generally it s a useful feature. 22/ 42

Chinese: No whitespace 莎拉波娃在居住在美国南部的佛 里 今年 4 月 9 日, 莎拉波娃在美国第一大城市 度 了 18 生日 生日派上, 莎拉波娃露出了甜美的微笑 23/ 42

Ambiguous segmentation in Chinese 和尚 The two characters can be treated as one word meaning monk or as a sequence of two words meaning and and still. 24/ 42

Other cases of no whitespace Compounds in Dutch and German Computerlinguistik Computer + Linguistik Lebensversicherungsgesellschaftsangestellter leben + versicherung + gesellschaft + angestellter Inuit: tusaatsiarunnanngittualuujunga (I can t hear very well.) Swedish, Finnish, Greek, Urdu, many other languages 25/ 42

Japanese! " # $ % & ' ( ( ) * + ) *, -. / 0 1 2 3 4 5 6 7 8 9 2 : ; < = >?@ / 4 A B C D C E F G H I J K L 6 M N?A B C D C E 2 O P 3 Q R 3 C % S 2 T 4 U V W H X Y % Z [ \ ] ^ ` a b / c d W H 2 $ 4 e f D g h 4 i = j 4 k D l m n 3 o _ p q _ 6 H r W s t u v w C J x y W > 4 z _ { } ~ / 2 Z ƒ q / ˆ V H I J 4 different alphabets : Chinese characters, hiragana syllabary for inflectional endings and function words, katakana syllabary for transcription of foreign words and other uses, and latin. No spaces (as in Chinese). End user can express query entirely in hiragana! 26/ 42

Arabic script ك ت ا ب آ ت اب un b ā t i k /kitābun/ a book 27/ 42

Arabic script: Bidirectionality استقلت الجزاي ر في سنة 1962 بعد 132 عاما من الاحتلال الفرنسي. START Algeria achieved its independence in 1962 after 132 years of French occupation. Bidirectionality is not a problem if text is coded in Unicode. 28/ 42

Accents and diacritics Accents: résumé vs. resume (simple omission of accent) Umlauts: Universität vs. Universitaet (substitution with special letter sequence ae ) Most important criterion: How are users likely to write their queries for these words? Even in languages that standardly have accents, users often do not type them. (Polish?) 29/ 42

Outline 1 Recap 2 The term vocabulary General + Non-English English 30/ 42

Normalization Need to normalize terms in indexed text as well as query terms into the same form. Example: We want to match U.S.A. and USA We most commonly implicitly define equivalence classes of terms. Alternatively: do asymmetric expansion window window, windows windows Windows, windows Windows (no expansion) More powerful, but less efficient Why don t you want to put window, Window, windows, and Windows in the same equivalence class? 31/ 42

Normalization: Other languages Normalization and language detection interact. PETER WILL NICHT MIT. MIT = mit He got his PhD from MIT. MIT mit 32/ 42

Case folding Reduce all letters to lower case Possible exceptions: capitalized words in mid-sentence MIT vs. mit Fed vs. fed It s often best to lowercase everything since users will use lowercase regardless of correct capitalization. 33/ 42

Stop words stop words = extremely common words which would appear to be of little value in helping select documents matching a user need Examples: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with Stop word elimination used to be standard in older IR systems. But you need stop words for phrase queries, e.g. King of Denmark Most web search engines index stop words. 34/ 42

More equivalence classing Soundex: (phonetic equivalence, Muller = Mueller) Thesauri: (semantic equivalence, car = automobile) 35/ 42

Lemmatization Reduce inflectional/variant forms to base form Example: am, are, is be Example: car, cars, car s, cars car Example: the boy s cars are different colors the boy car be different color Lemmatization implies doing proper reduction to dictionary headword form (the lemma). Inflectional morphology (cutting cut) vs. derivational morphology (destruction destroy) 36/ 42

Stemming Definition of stemming: Crude heuristic process that chops off the ends of words in the hope of achieving what principled lemmatization attempts to do with a lot of linguistic knowledge. Language dependent Often inflectional and derivational Example for derivational: automate, automatic, automation all reduce to automat 37/ 42

Porter algorithm Most common algorithm for stemming English Results suggest that it is at least as good as other stemming options Conventions + 5 phases of reductions Phases are applied sequentially Each phase consists of a set of commands. Sample command: Delete final ement if what remains is longer than 1 character replacement replac cement cement Sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. 38/ 42

Porter stemmer: A few rules Rule Example SSES SS caresses caress IES I ponies poni SS SS caress caress S cats cat 39/ 42

Three stemmers: A comparison Sample text: Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation Porter stemmer: such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret Lovins stemmer: such an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres Paice stemmer: such an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret 40/ 42

Does stemming improve effectiveness? In general, stemming increases effectiveness for some queries, and decreases effectiveness for others. Porter Stemmer equivalence class oper contains all of operate operating operates operation operative operatives operational. Queries where stemming hurts: operational AND research, operating AND system, operative AND dentistry 41/ 42

What does Google do? Stop words Normalization Tokenization Lowercasing Stemming Non-latin alphabets Umlauts Compounds Numbers 42/ 42