信息检索与搜索引擎 Introduction to Information Retrieval GESC PDF Free Download

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1

Last week We have discussed in more details about how index are created. Tokenization, normalization, lemmatization Phrase queries using positional indexes QQ Group: 738927894 Website: PPTs 2

Course schedule ( 日程安排 ) Lecture 1 Lecture 2 Lecture 3 Lecture 4 Lecture 5 Lecture 6 Lecture 7 Introduction Boolean retrieval ( 布尔检索模型 ) Term vocabulary and posting lists Dictionaries and tolerant retrieval Index construction and compression Scoring, weighting, and the vector space model Computer scores, and a complete search system Evaluation in information retrieval Web search engines, advanced topics, and conclusion 3

About last course Normalization - 规范化 : the process of converting tokens to a standard form Stemming: consists of removing the end of words (simple) cars airplanes car airplane Lemmatization: converting a word to a common base form called lemma (complicate) am, are, is be 4

CHAPTER 3 DICTIONARIES AND TOLERANT RETRIEVAL PDF p.86-5

Previous weeks Boolean retrieval model ( 布尔检索模型 using Boolean operators) Shenzhen AND food Phrase ( 短语 ) queries Airplane tickets from Beijing Proximity queries Shenzhen (within 5 words) of City To find documents, we have used a dictionary ( 词典 - also called inverted index 倒排索引 ). 6

Today How to deal with typographical errors ( 打字错误 )? Shenzhen vs Shenzhennn often made by accident ( 无意地 ) How to deal with different spellings ( 拼法 )? Color vs Colour analyze vs analyse How to deal with phonetically similar terms ( 发音相似的词 )? concede vs conceed right vs write vs rite vs wright 7

Wildcard queries ( 通配符查询 ) Wildcard (*) query: a query containing the wildcard ( 通配符 ) character * * = one or more characters e.g. automat* to search for: automated, automation, automata When should we use wildcard queries? when we want documents containing variants of a query term; when we are uncertain about how to spell a query term, e.g. Sydney vs Sidney 8

Searching for documents Given A set of documents An inverted index (dictionary 词典 ) A query ( 查询 ) we can search for documents. Several steps for searching 9

Example Dictionary China City Located Shenzhen Book1, Book 20, Book 34 Book1, Book2, Book 7, Book 20. Book1, Book3, Book 5, Book 9. The query is: CITY AND CHINA 1) Locate CITY in the dictionary 10

Example Dictionary China City Located Shenzhen Book1, Book 20, Book 34 Book1, Book2, Book 7, Book 20. Book1, Book3, Book 5, Book 9. The query is: CITY AND CHINA 1) Locate CITY in the dictionary 2) Retrieve its postings 11

How an IR system answers boolean queries? Dictionary China City Located Shenzhen Book1, Book 20, Book 34 Book1, Book2, Book 7, Book 20. Book1, Book3, Book 5, Book 9. The query is: CITY AND CHINA RESULT: 1) Locate CITY in the dictionary Book 1, Book20 2) Retrieve its postings 3) Locate CHINA in the dictionary 4) Retrieve its postings 5) Do the intersection ( 交线 ) of the two lists 14

How to quickly search terms in a dictionary? Different approaches Choosing an approach depends on: the number of terms in the dictionary (few or many?) the terms in the dictionary are static or dynamic (they may change)? (new terms are added? some terms are removed?) the relative frequencies( 相对频率 ) that each term is accessed (some terms are much more popular than others?) 15

Approach 1: Hashing ( 散列 ) Basic idea: An hash function ( 散列函数 ) is used to associate a positive number to each term of the dictionary. Example: h(shenzhen) = 1246 16

Approach 1: Hashing ( 散列 ) Example: We can define the hash function as the number of letters in a word h(term) = number of letters h(china) = 5 h(shenzhen) = 8 h(city) = 4 h(located) = 6 These numbers are called «hash values» ( 散列值 ) 17

Dictionary Approach 1: Hashing ( 散列 ) The dictionary is created such that terms are associated to their values 4 City Book1, Book2, Book 10,, Book20 5 China Book1, Book 20, 6 8 Located Shenzhen Book1, Book3 18

Approach 1: Hashing ( 散列 ) When searching in a dictionary, the hash function is used to quickly find the terms of the query. Dictionary 4 City Book1, Book2, Book 10,, Book20 5 China Book1, Book 20, 6 8 Located Shenzhen Book1, Book3 19

Approach 1: Hashing ( 散列 ) Dictionary City AND Shenzhen h(city) = h(shenzhen) = 4 City Book1, Book2, Book 10,, Book20 5 China Book1, Book 20, 6 Located 8 Shenzhen Book1, Book3 20

Approach 1: Hashing ( 散列 ) Dictionary City AND Shenzhen h(city) = 4 h(shenzhen) = 8 4 City Book1, Book2, Book 10,, Book20 5 China Book1, Book 20, 6 Located 8 Shenzhen Book1, Book3 21

Approach 1: Hashing ( 散列 ) Dictionary City AND Shenzhen h(city) = 4 h(shenzhen) = 8 4 City Book1, Book2, Book 10,, Book20 5 China Book1, Book 20, 6 Located 8 Shenzhen Book1, Book3 22

Approach 1: Hashing ( 散列 ) Dictionary City AND Shenzhen h(city) = 4 h(shenzhen) = 8 4 City Book1, Book2, Book 10,, Book20 5 China Book1, Book 20, 6 Located Result: Book 1 8 Shenzhen Book1, Book3 23

Advantage of Hashing ( 散列 ) Using a hash function ( 散列函数 ) is very fast for searching in a dictionary. Dictionary By calculating the value of the hash function, we can directly find where a term is located in the dictionary. 4 City Book1, Book2, Book 10,, Book20 5 China 24

Problem of Hashing ( 散列 ) However, it is possible that many terms have the same value for the hash function (this is a collision 冲突 ). In this case, this approach will still be slow In our example: Most words in English have less than 17 letters Thus, there will be many collisions. 25

Dictionary 4 City Maze Quiz Book1, Book2, Book 10,, Book20 5 Jury 26

Problem of Hashing ( 散列 ) We could solve that problem by using a better hash function ( 散列函数 ). h(term) = sum of the letters when converted to numbers h(city) = c + I + t + y 3 + 9 + 20 + 25 = 57 This would work better because terms are less likely to have the same number. 27

Problem of Hashing ( 散列 ) There is no simple way of finding variants of the same query term: resume vs résumé Those two words may not have the same number. We cannot do wildcard queries automat* to search for automated, automation 28

Approach 2: Search tree ( 搜索树 ) Basic idea: To be able to search quickly, a tree will be used. The terms will be inserted in the tree. The tree will be used to quickly search for the terms. 29

Ilustration a-m Root 根节点 n-z a-h h-m n-r s-z city located shenzhen 30

Ilustration a-m Root 根节点 n-z a-h h-m n-r s-z internal nodes ( 内部节点 ) city located shenzhen 31

Description of a search tree A search tree is a tree where each node can have several child nodes. To search for a term, we start from the root ( 根节点 ) of the tree. Each internal node ( 内部节点 ) in the tree has a test to decide which child node should be explored. The search ends when the term is found. EXAMPLE 32

Searching CITY a-m Root 根节点 n-z a-h h-m n-r s-z city located shenzhen 33

Searching CITY a-m Root 根节点 Root n-z a-h h-m n-r s-z city located shenzhen Search always start from the root of the tree 34

Searching CITY a-m Root 根节点 Root n-z a-h h-m n-r s-z city located shenzhen Search always start from the root of the tree 35

Searching CITY a-m Root 根节点 Root n-z a-h h-m n-r s-z city located shenzhen Search always start from the root of the tree 36

Searching CITY a-m Root 根节点 Root n-z a-h h-m n-r s-z city located shenzhen Search always start from the root of the tree 37

Searching Shenzhen a-m Root 根节点 n-z a-h h-m n-r s-z city located shenzhen 38

Searching Shenzhen a-m Root 根节点 n-z a-h h-m n-r s-z city located shenzhen 39

Searching Shenzhen a-m Root 根节点 n-z a-h h-m n-r s-z city located shenzhen 40

Searching Shenzhen a-m Root 根节点 n-z a-h h-m n-r s-z city located shenzhen 41

Searching Shenzhen a-m Root 根节点 n-z a-h h-m n-r s-z city located shenzhen 42

Approach 2: Search tree ( 搜索树 ) Advantages: Using a search tree ( 搜索树 ) allows to quickly find terms in a dictionary to answer a query. It allows to search all terms that match a prefix ( 前缀 ). e.g. automat* (a type of wildcard query) 43

Searching Automat * a-m Root 根节点 Root n-z a-h h-m n-r s-z utomated automation located shenzhen Search always start from the root of the tree 44

Technical details There are many types of search trees: binary tree ( 二叉树 ): a tree where each node has no more than two childs. B tree (B 树 ): a type of tree where all branches are equally long. B+ tree (B+ 树 ) : We will not discuss these details 45

How to apply this to Chinese? In English there is an order between letters: A, B, C. X, Y, Z. In Chinese, there is not a standard ordering for the characters used for dictionaries, etc. Semantically organized dictionaries Phonetically organized dictionary (pinyin) Number of strokes etc. 46

When to use wildcard queries? When the user is uncertain of the spelling of a term S*dney for Sydney or Sidney The user want to find variations of the same word. col*r for color or colour 47

When to use wildcard queries? The user want to find variations a term judic* for judicial or judiciary The user want to find a word that may be written differently in another language Universit* of Stutgart University Université Universitad 48

Trailing wildcard queries Trailing wildcard query: the * symbol appears at the end of a term. automat* judic* These queries can be easily handled using a search tree with a dictionary. 49

Leading wildcard queries Leading wildcard query: the * symbol appears at the beginning of a term. *mobile automobile mobile immobile How to handle these queries? Solution: use a reverse search tree where the terms are read backward. Thus two trees: one for trailing queries one for leading queries 50

Reverse search tree CITY a-m Root n-z a-h h-m n-s t-z located shenzhen Search always start from the root of the tree city 51

Reverse search tree CITY a-m Root n-z a-h h-m n-s t-z located shenzhen Search always start from the root of the tree city 52

Other wildcard queries? But what if the wilcard * is not at the end or beginning of a term? S*dney We would like to handle queries where the * symbol can appear anywhere in a term 53

Queries with one wildcard (*) Using a search tree and a reverse search tree, an IR system can answer any queries containing one wildcard (*). How? example: S*dney Use the search tree to find all terms starting with S*. Use the reverse search tree to find all terms ending with *dney. Calculate the intersection of the terms starting with S* and ending with *dney. Then, find the documents corresponding to these terms in the dictionary as usual. 54

Words that start with S*. Sidney Shanghai Shenzhen Words that end with *dney. Kidney Sidney Words that match s*dney Sidney 55

General wildcard queries General wildcard query: a query containing one or more wildcards (*) transf*mat* *an* How to answer these queries? Two techniques 56

Permuterm indexes The permuterm index is a special type of dictionary (which is also called inverted index). A special symbol $ is used to indicate the end of each term. hello$ Shenzhen$ Beijing$ 57

In a Permuterm index, all rotations of a term link to the term. Permuterm vocabulary Original term All rotations of a term are used to create the search tree 58

Searching with a permuterm index Example 1: a query m*n Rotate the term so that the * symbol appears at the end of the text: m*n$ n$m* Then, a search tree is used to find the terms containing n$m* We can find some terms such as: n$ma man n$moro moron 59

Searching with a permuterm index Example 2: a query fi*mo*er Search the tree for all terms containing er$fi* fishmonger fillibuster Then, keep only the terms that do not contain mo in the middle fishmonger 60

Permuterm indexes Advantage: can be used to answer all types of wildcard queries Disadvantage: We need to store all rotations of each term in the dictionary. The dictionary can be quite big. for English, this can increase the size of the dictionary by 10 times. 61

k-gram indexes This is another type of index for answering general wildcard queries. k-gram: a sequence of k characters e.g. 3-grams from the word castle: $ca, cas, ast, stl, tle, le$ 62

k-gram index The dictionary of a k-gram index contains all k-grams that occur in any terms in the vocabulary. cas castle 63

k-gram index: answering queries Answering a wildcard query e.g. re*ve we search all terms containing $re using the k-gram index we search all terms containing ve$ using the k-gram index we do the intersection of these terms remove, relive, retrieve then, we use a standard dictionary to find the documents matching these terms. cas castle 64

A problem Query: red* If we use the previous approach on a 3- gram index, we will find some words such as retired matching: $re and red. But they do not match the query red* Thus, for each term found, we still need to compare the query with the term to ensure that it matches the query. 65

More complex queries Many search engines allow complex queries such as: re*d AND fe*ri Those queries can be answered with the technique that we have discussed. Find all documents with re*d Find all documents with fe*ri Find the intersection of these documents Such queries may be slow are they require more processing. 66

SPELLING CORRECTION S*d*n*y 67

Spelling correction We will learn two techniques for dealing with spelling errors. e.g. carot instead of carrot 68

Two principles for spell correction 1. To correct a misspelled word, it is generally better to chose the nearest word (most similar word). carot carrot or carotid 2. If several correctly spelled words are equally similar to the mispelled word, then we should choose the most common word. grnt grunt or grant? - the most frequent in a text? - the most frequently used in queries by other users. 69

How search engine handle spelling errors? On the query carot, retrieve documents containing carot as well as the corrected term carrot. retrieves documents containing carrot if the term carot is not in the dictionary. retrieves documents containing carrot if the term carot returns few documents (less than a given number). show suggested spelling to the user, and let the user choose Did you mean carrot? 70

Forms of spelling corrections Isolated-term correction: we attempt to correct a single query term carot carrot Context-sensitive correction: consider the whole query to try to fix spelling errors flew form Heathrow flew from Heathrow 72

Edit distance ( 编辑距离 ) The edit distance between two terms s1 and s2 is the minimum number of edit operations to transform s1 into s2. Three operations: insert a character delete a character replace a character with another 73

Example editdistance( cat, dog) = 3 editdistance( cat, cat ) = 0 editdistance( cat, car ) = 1 editdistance( cat, cart ) = 1 editdistance( cat, category ) = 5 74

Spell-correction with edit distance To correct the spelling of a term (e.g. carot), we search for the terms that have the smallest edit distance with this term. editdistance(carot,carrot) = 1 editdistance(carot,carotid) = 2 But calculating this may be expensive (we don t want to compare each term with every other terms). Solution? 75

Solution We can use some heuristics ( 启发式 ) Only search for words beginning with the same letter as the query term. An alternative: use multiple rotations of the query term using a permuterm index to search for terms similar to the query term, while omitting some letters (see book p. 60). 76

k-gram index for spell correction Using k-grams is another way or reducing the number of candidate terms for spelling correction. Consider a query term q. We retrieve all terms containing the k- grams in q. We keep those having the smallest edit distance. 77

Example query = bord Using 2-grams, we find some terms similar to bord: Using the edit distance, we may find that border or lord are more likely than boardroom We can eliminate terms that are too different immediately (e.g. by comparing term lengths) 78

Variations Some types of errors are more frequent than others. We can use some weights to indicate that some operations are more important (likely) than others. e.g. Insert a character may be less likely than replacing a character with another 79

Context sensitive spelling correction Isolated-term correction may fail for some queries such as: flew form Heathrow flew from Heathrow A simple approach to consider the context Even if the words are spelled correctly, apply spellcorrection. Generate all combinations of corrected terms to create new queries. Execute all these queries on the search engine. Return the results for the query that has the largest number of results. This method can be time-consuming! 80

Alternatives We may use heuristics to reduce the number of possibilities. An heuristic: consider the most frequent combinations of query terms according to previous queries from other users. we keep flew from but not flea from or flew fore. 81

Conclusion Today, we have discussed in more details about how to search in dictionaries. We discussed wildcard queries. We discussed spell correction. The PPT slides are on the website. QQ group: 738927894 82

References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008 83