信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Similar documents
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Information Retrieval

Preliminary draft (c)2008 Cambridge UP

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1

Information Retrieval

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Recap of last time CS276A Information Retrieval

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Dictionaries and Tolerant retrieval

Dictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze

Lecture 3: Phrasal queries and wildcards

Information Retrieval CS-E credits

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval

Information Retrieval

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

OUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries

Digital Libraries: Language Technologies

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

数据挖掘 Introduction to Data Mining

Machine Vision Market Analysis of 2015 Isabel Yang

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Introduction to Computer Science

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

云计算入门 Introduction to Cloud Computing GESC1001

modern database systems lecture 4 : information retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

CS347. Lecture 2 April 9, Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Natural Language Processing

Multiprotocol Label Switching The future of IP Backbone Technology

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Oriented Scene Text Detection Revisited. Xiang Bai Huazhong University of Science and Technology

如何查看 Cache Engine 缓存中有哪些网站 /URL

Indexing and Searching

Boolean Queries. Keywords combined with Boolean operators:

60-538: Information Retrieval

Information Retrieval

Lecture 05: Basic Python Programming

Recap: lecture 2 CS276A Information Retrieval

Previous on Computer Networks Class 18. ICMP: Internet Control Message Protocol IP Protocol Actually a IP packet

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Part 2: Boolean Retrieval Francesco Ricci

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Information Retrieval

Indexing and Searching

Recap of the previous lecture. Recall the basic indexing pipeline. Plan for this lecture. Parsing a document. Introduction to Information Retrieval

Technology: Anti-social Networking 科技 : 反社交网络

Information Retrieval. (M&S Ch 15)

3 Keynote Speech:

Information Retrieval

CS 525: Advanced Database Organization 04: Indexing

Information Retrieval

上汽通用汽车供应商门户网站项目 (SGMSP) User Guide 用户手册 上汽通用汽车有限公司 2014 上汽通用汽车有限公司未经授权, 不得以任何形式使用本文档所包括的任何部分

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Information Retrieval and Organisation

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

CSCI 5417 Information Retrieval Systems Jim Martin!

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

CSE 562 Database Systems

vector space retrieval many slides courtesy James Amherst

Information Retrieval

数据挖掘 Introduction to Data Mining

GUJARAT TECHNOLOGICAL UNIVERSITY

Lecture 11: Packet forwarding

Bi-monthly report. Tianyi Luo

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)

nbns-list netbios-type network next-server option reset dhcp server conflict 1-34

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Natural Language Processing and Information Retrieval

XML allows your content to be created in one workflow, at one cost, to reach all your readers XML 的优势 : 只需一次加工和投入, 到达所有读者的手中

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

OTAD Application Note

Command Dictionary CUSTOM

Information Retrieval

CS 206 Introduction to Computer Science II

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval

Models for Document & Query Representation. Ziawasch Abedjan

Lecture 5: Information Retrieval using the Vector Space Model

Information Retrieval

实验三十三 DEIGRP 的配置 一 实验目的 二 应用环境 三 实验设备 四 实验拓扑 五 实验要求 六 实验步骤 1. 掌握 DEIGRP 的配置方法 2. 理解 DEIGRP 协议的工作过程

: Operating System 计算机原理与设计

Transcription:

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1

Last week We have discussed in more details about how index are created. Tokenization, normalization, lemmatization Phrase queries using positional indexes QQ Group: 738927894 Website: PPTs 2

Course schedule ( 日程安排 ) Lecture 1 Lecture 2 Lecture 3 Lecture 4 Lecture 5 Lecture 6 Lecture 7 Introduction Boolean retrieval ( 布尔检索模型 ) Term vocabulary and posting lists Dictionaries and tolerant retrieval Index construction and compression Scoring, weighting, and the vector space model Computer scores, and a complete search system Evaluation in information retrieval Web search engines, advanced topics, and conclusion 3

About last course Normalization - 规范化 : the process of converting tokens to a standard form Stemming: consists of removing the end of words (simple) cars airplanes car airplane Lemmatization: converting a word to a common base form called lemma (complicate) am, are, is be 4

CHAPTER 3 DICTIONARIES AND TOLERANT RETRIEVAL PDF p.86-5

Previous weeks Boolean retrieval model ( 布尔检索模型 using Boolean operators) Shenzhen AND food Phrase ( 短语 ) queries Airplane tickets from Beijing Proximity queries Shenzhen (within 5 words) of City To find documents, we have used a dictionary ( 词典 - also called inverted index 倒排索引 ). 6

Today How to deal with typographical errors ( 打字错误 )? Shenzhen vs Shenzhennn often made by accident ( 无意地 ) How to deal with different spellings ( 拼法 )? Color vs Colour analyze vs analyse How to deal with phonetically similar terms ( 发音相似的词 )? concede vs conceed right vs write vs rite vs wright 7

Wildcard queries ( 通配符查询 ) Wildcard (*) query: a query containing the wildcard ( 通配符 ) character * * = one or more characters e.g. automat* to search for: automated, automation, automata When should we use wildcard queries? when we want documents containing variants of a query term; when we are uncertain about how to spell a query term, e.g. Sydney vs Sidney 8

Searching for documents Given A set of documents An inverted index (dictionary 词典 ) A query ( 查询 ) we can search for documents. Several steps for searching 9

Example Dictionary China City Located Shenzhen Book1, Book 20, Book 34 Book1, Book2, Book 7, Book 20. Book1, Book3, Book 5, Book 9. The query is: CITY AND CHINA 1) Locate CITY in the dictionary 10

Example Dictionary China City Located Shenzhen Book1, Book 20, Book 34 Book1, Book2, Book 7, Book 20. Book1, Book3, Book 5, Book 9. The query is: CITY AND CHINA 1) Locate CITY in the dictionary 2) Retrieve its postings 11

Example Dictionary China City Located Shenzhen Book1, Book 20, Book 34 Book1, Book2, Book 7, Book 20. Book1, Book3, Book 5, Book 9. The query is: CITY AND CHINA 1) Locate CITY in the dictionary 2) Retrieve its postings 3) Locate CHINA in the dictionary 12

Example Dictionary China City Located Shenzhen Book1, Book 20, Book 34 Book1, Book2, Book 7, Book 20. Book1, Book3, Book 5, Book 9. The query is: CITY AND CHINA 1) Locate CITY in the dictionary 2) Retrieve its postings 3) Locate CHINA in the dictionary 4) Retrieve its postings 13

How an IR system answers boolean queries? Dictionary China City Located Shenzhen Book1, Book 20, Book 34 Book1, Book2, Book 7, Book 20. Book1, Book3, Book 5, Book 9. The query is: CITY AND CHINA RESULT: 1) Locate CITY in the dictionary Book 1, Book20 2) Retrieve its postings 3) Locate CHINA in the dictionary 4) Retrieve its postings 5) Do the intersection ( 交线 ) of the two lists 14

How to quickly search terms in a dictionary? Different approaches Choosing an approach depends on: the number of terms in the dictionary (few or many?) the terms in the dictionary are static or dynamic (they may change)? (new terms are added? some terms are removed?) the relative frequencies( 相对频率 ) that each term is accessed (some terms are much more popular than others?) 15

Approach 1: Hashing ( 散列 ) Basic idea: An hash function ( 散列函数 ) is used to associate a positive number to each term of the dictionary. Example: h(shenzhen) = 1246 16

Approach 1: Hashing ( 散列 ) Example: We can define the hash function as the number of letters in a word h(term) = number of letters h(china) = 5 h(shenzhen) = 8 h(city) = 4 h(located) = 6 These numbers are called «hash values» ( 散列值 ) 17

Dictionary Approach 1: Hashing ( 散列 ) The dictionary is created such that terms are associated to their values 4 City Book1, Book2, Book 10,, Book20 5 China Book1, Book 20, 6 8 Located Shenzhen Book1, Book3 18

Approach 1: Hashing ( 散列 ) When searching in a dictionary, the hash function is used to quickly find the terms of the query. Dictionary 4 City Book1, Book2, Book 10,, Book20 5 China Book1, Book 20, 6 8 Located Shenzhen Book1, Book3 19

Approach 1: Hashing ( 散列 ) Dictionary City AND Shenzhen h(city) = h(shenzhen) = 4 City Book1, Book2, Book 10,, Book20 5 China Book1, Book 20, 6 Located 8 Shenzhen Book1, Book3 20

Approach 1: Hashing ( 散列 ) Dictionary City AND Shenzhen h(city) = 4 h(shenzhen) = 8 4 City Book1, Book2, Book 10,, Book20 5 China Book1, Book 20, 6 Located 8 Shenzhen Book1, Book3 21

Approach 1: Hashing ( 散列 ) Dictionary City AND Shenzhen h(city) = 4 h(shenzhen) = 8 4 City Book1, Book2, Book 10,, Book20 5 China Book1, Book 20, 6 Located 8 Shenzhen Book1, Book3 22

Approach 1: Hashing ( 散列 ) Dictionary City AND Shenzhen h(city) = 4 h(shenzhen) = 8 4 City Book1, Book2, Book 10,, Book20 5 China Book1, Book 20, 6 Located Result: Book 1 8 Shenzhen Book1, Book3 23

Advantage of Hashing ( 散列 ) Using a hash function ( 散列函数 ) is very fast for searching in a dictionary. Dictionary By calculating the value of the hash function, we can directly find where a term is located in the dictionary. 4 City Book1, Book2, Book 10,, Book20 5 China 24

Problem of Hashing ( 散列 ) However, it is possible that many terms have the same value for the hash function (this is a collision 冲突 ). In this case, this approach will still be slow In our example: Most words in English have less than 17 letters Thus, there will be many collisions. 25

Dictionary 4 City Maze Quiz Book1, Book2, Book 10,, Book20 5 Jury 26

Problem of Hashing ( 散列 ) We could solve that problem by using a better hash function ( 散列函数 ). h(term) = sum of the letters when converted to numbers h(city) = c + I + t + y 3 + 9 + 20 + 25 = 57 This would work better because terms are less likely to have the same number. 27

Problem of Hashing ( 散列 ) There is no simple way of finding variants of the same query term: resume vs résumé Those two words may not have the same number. We cannot do wildcard queries automat* to search for automated, automation 28

Approach 2: Search tree ( 搜索树 ) Basic idea: To be able to search quickly, a tree will be used. The terms will be inserted in the tree. The tree will be used to quickly search for the terms. 29

Ilustration a-m Root 根节点 n-z a-h h-m n-r s-z city located shenzhen 30

Ilustration a-m Root 根节点 n-z a-h h-m n-r s-z internal nodes ( 内部节点 ) city located shenzhen 31

Description of a search tree A search tree is a tree where each node can have several child nodes. To search for a term, we start from the root ( 根节点 ) of the tree. Each internal node ( 内部节点 ) in the tree has a test to decide which child node should be explored. The search ends when the term is found. EXAMPLE 32

Searching CITY a-m Root 根节点 n-z a-h h-m n-r s-z city located shenzhen 33

Searching CITY a-m Root 根节点 Root n-z a-h h-m n-r s-z city located shenzhen Search always start from the root of the tree 34

Searching CITY a-m Root 根节点 Root n-z a-h h-m n-r s-z city located shenzhen Search always start from the root of the tree 35

Searching CITY a-m Root 根节点 Root n-z a-h h-m n-r s-z city located shenzhen Search always start from the root of the tree 36

Searching CITY a-m Root 根节点 Root n-z a-h h-m n-r s-z city located shenzhen Search always start from the root of the tree 37

Searching Shenzhen a-m Root 根节点 n-z a-h h-m n-r s-z city located shenzhen 38

Searching Shenzhen a-m Root 根节点 n-z a-h h-m n-r s-z city located shenzhen 39

Searching Shenzhen a-m Root 根节点 n-z a-h h-m n-r s-z city located shenzhen 40

Searching Shenzhen a-m Root 根节点 n-z a-h h-m n-r s-z city located shenzhen 41

Searching Shenzhen a-m Root 根节点 n-z a-h h-m n-r s-z city located shenzhen 42

Approach 2: Search tree ( 搜索树 ) Advantages: Using a search tree ( 搜索树 ) allows to quickly find terms in a dictionary to answer a query. It allows to search all terms that match a prefix ( 前缀 ). e.g. automat* (a type of wildcard query) 43

Searching Automat * a-m Root 根节点 Root n-z a-h h-m n-r s-z utomated automation located shenzhen Search always start from the root of the tree 44

Technical details There are many types of search trees: binary tree ( 二叉树 ): a tree where each node has no more than two childs. B tree (B 树 ): a type of tree where all branches are equally long. B+ tree (B+ 树 ) : We will not discuss these details 45

How to apply this to Chinese? In English there is an order between letters: A, B, C. X, Y, Z. In Chinese, there is not a standard ordering for the characters used for dictionaries, etc. Semantically organized dictionaries Phonetically organized dictionary (pinyin) Number of strokes etc. 46

When to use wildcard queries? When the user is uncertain of the spelling of a term S*dney for Sydney or Sidney The user want to find variations of the same word. col*r for color or colour 47

When to use wildcard queries? The user want to find variations a term judic* for judicial or judiciary The user want to find a word that may be written differently in another language Universit* of Stutgart University Université Universitad 48

Trailing wildcard queries Trailing wildcard query: the * symbol appears at the end of a term. automat* judic* These queries can be easily handled using a search tree with a dictionary. 49

Leading wildcard queries Leading wildcard query: the * symbol appears at the beginning of a term. *mobile automobile mobile immobile How to handle these queries? Solution: use a reverse search tree where the terms are read backward. Thus two trees: one for trailing queries one for leading queries 50

Reverse search tree CITY a-m Root n-z a-h h-m n-s t-z located shenzhen Search always start from the root of the tree city 51

Reverse search tree CITY a-m Root n-z a-h h-m n-s t-z located shenzhen Search always start from the root of the tree city 52

Other wildcard queries? But what if the wilcard * is not at the end or beginning of a term? S*dney We would like to handle queries where the * symbol can appear anywhere in a term 53

Queries with one wildcard (*) Using a search tree and a reverse search tree, an IR system can answer any queries containing one wildcard (*). How? example: S*dney Use the search tree to find all terms starting with S*. Use the reverse search tree to find all terms ending with *dney. Calculate the intersection of the terms starting with S* and ending with *dney. Then, find the documents corresponding to these terms in the dictionary as usual. 54

Words that start with S*. Sidney Shanghai Shenzhen Words that end with *dney. Kidney Sidney Words that match s*dney Sidney 55

General wildcard queries General wildcard query: a query containing one or more wildcards (*) transf*mat* *an* How to answer these queries? Two techniques 56

Permuterm indexes The permuterm index is a special type of dictionary (which is also called inverted index). A special symbol $ is used to indicate the end of each term. hello$ Shenzhen$ Beijing$ 57

In a Permuterm index, all rotations of a term link to the term. Permuterm vocabulary Original term All rotations of a term are used to create the search tree 58

Searching with a permuterm index Example 1: a query m*n Rotate the term so that the * symbol appears at the end of the text: m*n$ n$m* Then, a search tree is used to find the terms containing n$m* We can find some terms such as: n$ma man n$moro moron 59

Searching with a permuterm index Example 2: a query fi*mo*er Search the tree for all terms containing er$fi* fishmonger fillibuster Then, keep only the terms that do not contain mo in the middle fishmonger 60

Permuterm indexes Advantage: can be used to answer all types of wildcard queries Disadvantage: We need to store all rotations of each term in the dictionary. The dictionary can be quite big. for English, this can increase the size of the dictionary by 10 times. 61

k-gram indexes This is another type of index for answering general wildcard queries. k-gram: a sequence of k characters e.g. 3-grams from the word castle: $ca, cas, ast, stl, tle, le$ 62

k-gram index The dictionary of a k-gram index contains all k-grams that occur in any terms in the vocabulary. cas castle 63

k-gram index: answering queries Answering a wildcard query e.g. re*ve we search all terms containing $re using the k-gram index we search all terms containing ve$ using the k-gram index we do the intersection of these terms remove, relive, retrieve then, we use a standard dictionary to find the documents matching these terms. cas castle 64

A problem Query: red* If we use the previous approach on a 3- gram index, we will find some words such as retired matching: $re and red. But they do not match the query red* Thus, for each term found, we still need to compare the query with the term to ensure that it matches the query. 65

More complex queries Many search engines allow complex queries such as: re*d AND fe*ri Those queries can be answered with the technique that we have discussed. Find all documents with re*d Find all documents with fe*ri Find the intersection of these documents Such queries may be slow are they require more processing. 66

SPELLING CORRECTION S*d*n*y 67

Spelling correction We will learn two techniques for dealing with spelling errors. e.g. carot instead of carrot 68

Two principles for spell correction 1. To correct a misspelled word, it is generally better to chose the nearest word (most similar word). carot carrot or carotid 2. If several correctly spelled words are equally similar to the mispelled word, then we should choose the most common word. grnt grunt or grant? - the most frequent in a text? - the most frequently used in queries by other users. 69

How search engine handle spelling errors? On the query carot, retrieve documents containing carot as well as the corrected term carrot. retrieves documents containing carrot if the term carot is not in the dictionary. retrieves documents containing carrot if the term carot returns few documents (less than a given number). show suggested spelling to the user, and let the user choose Did you mean carrot? 70

71

Forms of spelling corrections Isolated-term correction: we attempt to correct a single query term carot carrot Context-sensitive correction: consider the whole query to try to fix spelling errors flew form Heathrow flew from Heathrow 72

Edit distance ( 编辑距离 ) The edit distance between two terms s1 and s2 is the minimum number of edit operations to transform s1 into s2. Three operations: insert a character delete a character replace a character with another 73

Example editdistance( cat, dog) = 3 editdistance( cat, cat ) = 0 editdistance( cat, car ) = 1 editdistance( cat, cart ) = 1 editdistance( cat, category ) = 5 74

Spell-correction with edit distance To correct the spelling of a term (e.g. carot), we search for the terms that have the smallest edit distance with this term. editdistance(carot,carrot) = 1 editdistance(carot,carotid) = 2 But calculating this may be expensive (we don t want to compare each term with every other terms). Solution? 75

Solution We can use some heuristics ( 启发式 ) Only search for words beginning with the same letter as the query term. An alternative: use multiple rotations of the query term using a permuterm index to search for terms similar to the query term, while omitting some letters (see book p. 60). 76

k-gram index for spell correction Using k-grams is another way or reducing the number of candidate terms for spelling correction. Consider a query term q. We retrieve all terms containing the k- grams in q. We keep those having the smallest edit distance. 77

Example query = bord Using 2-grams, we find some terms similar to bord: Using the edit distance, we may find that border or lord are more likely than boardroom We can eliminate terms that are too different immediately (e.g. by comparing term lengths) 78

Variations Some types of errors are more frequent than others. We can use some weights to indicate that some operations are more important (likely) than others. e.g. Insert a character may be less likely than replacing a character with another 79

Context sensitive spelling correction Isolated-term correction may fail for some queries such as: flew form Heathrow flew from Heathrow A simple approach to consider the context Even if the words are spelled correctly, apply spellcorrection. Generate all combinations of corrected terms to create new queries. Execute all these queries on the search engine. Return the results for the query that has the largest number of results. This method can be time-consuming! 80

Alternatives We may use heuristics to reduce the number of possibilities. An heuristic: consider the most frequent combinations of query terms according to previous queries from other users. we keep flew from but not flea from or flew fore. 81

Conclusion Today, we have discussed in more details about how to search in dictionaries. We discussed wildcard queries. We discussed spell correction. The PPT slides are on the website. QQ group: 738927894 82

References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008 83