Natural Language Processing and Information Retrieval

Similar documents
3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Dictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Dictionaries and Tolerant retrieval

Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Information Retrieval

OUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries

Recap of last time CS276A Information Retrieval

Information Retrieval

Information Retrieval


Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Information Retrieval

Informa(on Retrieval

Part 2: Boolean Retrieval Francesco Ricci

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Web Information Retrieval. Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries

Information Retrieval

Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

CSCI 5417 Information Retrieval Systems Jim Martin!

Digital Libraries: Language Technologies

CS60092: Informa0on Retrieval

More about Posting Lists

Information Retrieval and Organisation

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

Lecture 2: Encoding models for text. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Recap of the previous lecture. Recall the basic indexing pipeline. Plan for this lecture. Parsing a document. Introduction to Information Retrieval

Informa(on Retrieval

Part 3: The term vocabulary, postings lists and tolerant retrieval Francesco Ricci

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

Multimedia Information Extraction and Retrieval Indexing and Query Answering

Informa(on Retrieval

Information Retrieval

Introduc)on to. CS60092: Informa0on Retrieval

CS347. Lecture 2 April 9, Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Informa(on Retrieval

Informa(on Retrieval

Indexing and Searching

Information Retrieval

n Tuesday office hours changed: n 2-3pm n Homework 1 due Tuesday n Assignment 1 n Due next Friday n Can work with a partner

Text Pre-processing and Faster Query Processing

Introduction to Information Retrieval

Information Retrieval CS-E credits

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

Indexing and Searching

Recap: lecture 2 CS276A Information Retrieval

Course structure & admin. CS276A Text Information Retrieval, Mining, and Exploitation. Dictionary and postings files: a fast, compact inverted index

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Information Retrieval

Information Retrieval

Related Course Objec6ves

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Information Retrieval. Chap 7. Text Operations

INDEX CONSTRUCTION 1

Information Retrieval and Applications

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Introduction to Information Retrieval

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Corso di Biblioteche Digitali

Introduction to Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

A Modern spell(1) Abhinav Upadhyay EuroBSDCon 2017, Paris

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Information Retrieval

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Informa(on Retrieval

LCP Array Construction

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

Information Retrieval and Web Search

Information Retrieval

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Search Engines. Gertjan van Noord. September 17, 2018

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Indexing and Query Processing. What will we cover?

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Information Retrieval

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Index Construction. Slides by Manning, Raghavan, Schutze

Objec+ves. Review. Basics of Java Syntax Java fundamentals. What are quali+es of good sooware? What is Java? How do you compile a Java program?

CS 525: Advanced Database Organization 04: Indexing

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Practice problems Set 2

Transcription:

Natural Language Processing and Information Retrieval Indexing and Vector Space Models Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it

Outline Preprocessing for Inverted index production Vector Space

Sec. 2.2.2 Stop words With a stop list, you exclude from the dic5onary en5rely the commonest words. Intui5on: They have li=le seman5c content: the, a, and, to, be There are a lot of them: ~30% of pos5ngs for top 30 words But the trend is away from doing this: Good compression techniques means the space for including stopwords in a system is very small Good query op5miza5on techniques mean you pay li=le at query 5me for including stop words. You need them for: Phrase queries: King of Denmark Various song 5tles, etc.: Let it be, To be or not to be Rela5onal queries: flights to London 3

Sec. 2.2.3 Normaliza0on to terms We need to normalize words in indexed text as well as query words into the same form We want to match U.S.A. and USA Result is terms: a term is a (normalized) word type, which is an entry in our IR system dic5onary We most commonly implicitly define equivalence classes of terms by, e.g., dele5ng periods to form a term U.S.A., USA USA dele5ng hyphens to form a term an( discriminatory, an(discriminatory an(discriminatory 4

Sec. 2.2.3 Case folding Reduce all le=ers to lower case excep5on: upper case in mid sentence? e.g., General Motors Fed vs. fed SAIL vs. sail OYen best to lower case everything, since users will use lowercase regardless of correct capitaliza5on Google example: Query C.A.T. #1 result was for cat (well, Lolcats) not Caterpillar Inc. 5

Sec. 2.2.3 Normaliza0on to terms An alterna5ve to equivalence classing is to do asymmetric expansion An example of where this may be useful Enter: window Enter: windows Enter: Windows Search: window, windows Search: Windows, windows, window Search: Windows Poten5ally more powerful, but less efficient 6

Sec. 2.2.4 Lemma0za0on Reduce inflec5onal/variant forms to base form E.g., am, are, is be car, cars, car's, cars' car the boy's cars are different colors the boy car be different color Lemma5za5on implies doing proper reduc5on to dic5onary headword form 7

Sec. 2.2.4 Stemming Reduce terms to their roots before indexing Stemming suggest crude affix chopping language dependent e.g., automate(s), automa(c, automa(on all reduced to automat. for example compressed and compression are both accepted as equivalent to for exampl compress and compress ar both accept as equival to compress compress. 8

Sec. 2.2.4 Porter s algorithm Commonest algorithm for stemming English Results suggest it s at least as good as other stemming op5ons Conven5ons + 5 phases of reduc5ons phases applied sequen5ally each phase consists of a set of commands sample conven5on: Of the rules in a compound command, select the one that applies to the longest suffix. 9

Sec. 2.2.4 Typical rules in Porter sses ss ies i a<onal ate <onal <on Rules sensi5ve to the measure of words (m>1) EMENT replacement replac cement cement 10

Dic0onary data structures for inverted indexes Sec. 3.1 The dic5onary data structure stores the term vocabulary, document frequency, pointers to each pos5ngs list in what data structure? 11

Sec. 3.1 A naïve dic0onary An array of struct: char[20] int Pos5ngs * 20 bytes 4/8 bytes 4/8 bytes How do we store a dic5onary in memory efficiently? How do we quickly look up elements at query 5me?

Sec. 3.1 Dic0onary data structures Two main choices: Hashtables Trees Some IR systems use hashtables, some trees 13

Sec. 3.1 Hashtables Each vocabulary term is hashed to an integer (We assume you ve seen hashtables before) Pros: Lookup is faster than for a tree: O(1) Cons: No easy way to find minor variants: judgment/judgement No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally do the expensive opera5on of rehashing everything 14

Trees: binary tree Sec. 3.1 a-m Root n-z a-hu hy-m n-sh si-z 15

Sec. 3.1 Tree: B tree a-hu hy-m n-z Defini5on: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4]. 16

Sec. 3.1 Trees Simplest: binary tree More usual: B trees Trees require a standard ordering of characters and hence strings but we typically have one Pros: Solves the prefix problem (terms star5ng with hyp) Cons: Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive But B trees mi5gate the rebalancing problem 17

Sec. 3.2 Wild card queries: * mon*: find all docs containing any word beginning with mon. Easy with binary tree (or B tree) lexicon: retrieve all words in range: mon w < moo *mon: find words ending in mon : harder Maintain an addi5onal B tree for terms backwards. Can retrieve all words in range: nom w < non. Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent? 18

Sec. 3.2.2 Bigram (k gram) indexes Enumerate all k grams (sequence of k chars) occurring in any term e.g., from text April is the cruelest month we get the 2 grams (bigrams) $a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru, ue,el,le,es,st,t$, $m,mo,on,nt,h$ $ is a special word boundary symbol Maintain a second inverted index from bigrams to dic<onary terms that match each bigram. 19

Sec. 3.2.2 Bigram index example The k gram index finds terms based on a query consis5ng of k grams (here k=2). $m mace madden mo on among along amortize among 20

SPELLING CORRECTION 21

Sec. 3.3 Spell correc0on Two principal uses Correc5ng document(s) being indexed Correc5ng user queries to retrieve right answers Two main flavors: Isolated word Check each word on its own for misspelling Will not catch typos resul5ng in correctly spelled words e.g., from form Context sensi5ve Look at surrounding words, e.g., I flew form Heathrow to Narita. 22

Sec. 3.3 Document correc0on Especially needed for OCR ed documents Correc5on algorithms are tuned for this: rn/m Can use domain specific knowledge E.g., OCR can confuse O and D more oyen than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing). But also: web pages and even printed material have typos Goal: the dic5onary contains fewer misspellings But oyen we don t change the documents and instead fix the query document mapping 23

Sec. 3.3 Query mis spellings Our principal focus here E.g., the query Alanis MoriseM We can either Retrieve documents indexed by the correct spelling, OR Return several suggested alterna5ve queries with the correct spelling Did you mean? 24

Sec. 3.3.2 Isolated word correc0on Fundamental premise there is a lexicon from which the correct spellings come Two basic choices for this A standard lexicon such as Webster s English Dic5onary An industry specific lexicon hand maintained The lexicon of the indexed corpus E.g., all words on the web All names, acronyms etc. (Including the mis spellings) 25

Sec. 3.3.2 Isolated word correc0on Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q What s closest? We ll study several alterna5ves Edit distance (Levenshtein distance) Weighted edit distance n gram overlap 26

Sec. 3.3.3 Edit distance Given two strings S 1 and S 2, the minimum number of opera5ons to convert one to the other Opera5ons are typically character level Insert, Delete, Replace, (Transposi5on) E.g., the edit distance from dof to dog is 1 From cat to act is 2 from cat to dog is 3. (Just 1 with transpose.) Generally found by dynamic programming. See h=p://www.merriampark.com/ld.htm for a nice example plus an applet. 27

Sec. 3.3.3 Weighted edit distance As above, but the weight of an opera5on depends on the character(s) involved Meant to capture OCR or keyboard errors Example: m more likely to be mis typed as n than as q Therefore, replacing m by n is a smaller edit distance than by q This may be formulated as a probability model Requires weight matrix as input Modify dynamic programming to handle weights 28

Sec. 3.3.4 Using edit distances Given query, first enumerate all character sequences within a preset (weighted) edit distance (e.g., 2) Intersect this set with list of correct words Show terms you found to user as sugges5ons Alterna5vely, We can look up all possible correc5ons in our inverted index and return all docs slow We can run with a single most likely correc5on The alterna5ves disempower the user, but save a round of interac5on with the user 29

Sec. 3.3.4 Edit distance to all dic0onary terms? Given a (mis spelled) query do we compute its edit distance to every dic5onary term? Expensive and slow Alterna5ve? How do we cut the set of candidate dic5onary terms? One possibility is to use n gram overlap for this This can also be used by itself for spelling correc5on. 30

Sec. 3.3.4 n gram overlap Enumerate all the n grams in the query string as well as in the lexicon Use the n gram index (recall wild card search) to retrieve all lexicon terms matching any of the query ngrams Threshold by number of matching n grams Variants weight by keyboard layout, etc. 31

Sec. 3.3.4 Example with trigrams Suppose the text is november Trigrams are nov, ove, vem, emb, mbe, ber. The query is december Trigrams are dec, ece, cem, emb, mbe, ber. So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized measure of overlap? 32

Sec. 3.3.4 One op0on Jaccard coefficient A commonly used measure of overlap Let X and Y be two sets; then the J.C. is X Y / X Y Equals 1 when X and Y have the same elements and zero when they are disjoint X and Y don t have to be of the same size Always assigns a number between 0 and 1 Now threshold to decide if you have a match E.g., if J.C. > 0.8, declare a match 33

Sec. 3.3.4 Matching trigrams Consider the query lord we wish to iden5fy words matching 2 of its 3 bigrams (lo, or, rd) lo alone lore sloth or border lore morbid rd ardent border card Standard postings merge will enumerate Adapt this to using Jaccard (or another) measure. 34

Sec. 3.3.5 Context sensi0ve spell correc0on Text: I flew from Heathrow to Narita. Consider the phrase query flew form Heathrow We d like to respond Did you mean flew from Heathrow? because no docs matched the query phrase. 35

Sec. 3.3.5 Context sensi0ve correc0on Need surrounding context to catch this. First idea: retrieve dic5onary terms close (in weighted edit distance) to each query term Now try all possible resul5ng phrases with one word fixed at a 5me flew from heathrow fled form heathrow flea form heathrow Hit based spelling correc0on: Suggest the alterna5ve that has lots of hits. 36

Sec. 3.3.5 Exercise Suppose that for flew form Heathrow we have 7 alterna5ves for flew, 19 for form and 3 for heathrow. How many corrected phrases will we enumerate in this scheme? 37

Sec. 3.3.5 General issues in spell correc0on We enumerate mul5ple alterna5ves for Did you mean? Need to figure out which to present to the user The alterna5ve hizng most docs Query log analysis More generally, rank alterna5ves probabilis5cally argmax corr P(corr query) From Bayes rule, this is equivalent to argmax corr P(query corr) * P(corr) Noisy channel Language model 38

End Lecture