信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Size: px

Start display at page:

Download "信息检索与搜索引擎 Introduction to Information Retrieval GESC1007"

Bryan Lawrence
5 years ago
Views:

1 信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities Spring

2 Last week We have discussed about: Hashing ( 散列 ) and search trees ( 搜索树 ) Wildcard queries Spell correction QQ Group: Website: PPTs 2

3 Course schedule ( 日程安排 ) Lecture 1 Lecture 2 Lecture 3 Lecture 4 Lecture 5 Lecture 6 Lecture 7 Introduction Boolean retrieval ( 布尔检索模型 ) Term vocabulary and posting lists Dictionaries and tolerant retrieval Index construction and compression Scoring, weighting, and the vector space model Computer scores, and a complete search system Evaluation in information retrieval Web search engines, advanced topics, and conclusion 3

4 PHONETIC ( 语音的 ) CORRECTION Write Right Rite Wright 4

5 Phonetic correction Misspellings are often caused by a user typing a query that sounds like the target term. Phonetic hashing: try to group together all terms that sound similar. 5

6 Soundex algorithms 1. Turn every term to be indexed into a 4- character reduced form Hermann H Use these character to create an inverted index (dictionary 词典 ). The dictionary is called soundex index 3. Do the same with query terms 4. When a new query arrives, search using the soundex index. 6

7 How to calculate the 4 character codes? 1. Retain the first letter of the term. 2. Change all occurrences of the following letters to 0 (zero): A,E, I, O, U, H, W, Y 3. Change letters to digits as follows: B, F, P, V to 1. C, G, J, K, Q, S, X, Z to 2. D,T to 3. L to 4. M, N to 5. R to Repeatedly remove one out of each pair of consecutive identical digits 5. Remove all zeros from the resulting text. Pad the resulting text with trailing zeros and return the first four positions, which will consist of a letter followed by three digits. 7

8 Observation about Soundex Vowels ( 元音 ) are viewed as interchangeable in transcribing names A,E, I, O, U, H, W, Y. Consonants ( 辅音 ) with similar sounds are considered to be the same. e.g. D and T These rules work for most European languages. 8

9 CHAPTER 4 INDEX CONSTRUCTION PDF p.104 9

10 Introduction We will talk about how to construct an inverted index. This process is called index construction or indexing ( 索引 ). It is performed by some software called an indexer ( 索引器 ). A collection of documents An inverted index to search for documents Doc 1 Book1 Indexer Book1 Book1 Doc2 10

11 Introduction For a Web search engine like Baidu or Bing, the indexer is called a spider or web crawler ( 网络爬虫 ). A web crawler is a software that will browse the internet periodically to update its index of webpages. 11

12 Types of IR systems There are: small scale Information Retrieval systems (e.g. to search documents in a company) large scale Information Retrieval systems (e.g. to search the Web). In general, we want an IR system to be fast. Thus, characteristics of the computer hardware ( 计算机硬件 ) must be considered. 12

13 Computer memory There are two main types of memory in a computer: Hard drive ( 硬盘驱动器 ) RAM memory (RAM 芯片 ) Permanent storage Cheaper Slower Temporary storage Expensive Fast 13

14 About hardware 1) Data access time ( 访问时间 ) Accessing the data in RAM is faster than accessing the data in a hard drive. To increase the speed of an IR system we should keep as much data as possible in RAM. We may use a computer having several gigabytes (GB) of RAM for an IR system. A technique called caching ( 缓存 ) consists of keeping the most frequently accessed data in RAM memory. 14

15 About hardware 2) How the data is organized is important How the data is organized in memory also influences how fast the data can be read or written. In general, if the data that we read is stored contiguously ( 连续的 ) on the hard drive, then reading the data will be faster than if the data is not stored contiguously. Data stored contiguously Data not stored contiguously

16 About hardware 3) Data compression ( 数据压缩 ) can reduce the time for reading data on the hard drive Data compression refers to techniques for reducing the size of the data. If the data is smaller, reading it is faster. Uncompressed data Compressed data 16

17 Simple approach for index construction Step 1. Each document from the collection is read. For each word, a <term, document ID > pair is created. e.g. this indicates that the term Brutus appears in document #1 17

18 Simple approach for index construction Step 1. Each document from the collection is read. For each word, a <term, document ID > pair is created. e.g. this indicates that the term Brutus also appears in document #2 18

19 Step 2. All the pairs are sorted alphabetically Thus, all pairs representing the same term now appears consecutively. e.g. was 19

20 Step 3. The pairs with same terms are then combined to create the inverted index (dictionary) 20

21 Step 3. The pairs are then used to create the inverted index (dictionary) The term Brutus The frequency of this term (optional). Brutus appears in 2 documents The posting list. Brutus appears in documents 1 and 2 21

22 Example Reuters-RCV1: a collection of about 800,000 news documents published between August 20, 1996 and August 19, GB of text, average: 200 tokens per document 400,000 terms 22

23 Example (cont d) 100 million tokens Each token requires 32 bits of memory Storing the texts takes 0.8 GB This collection of documents can fit in the memory of a desktop computer. However, for larger document collections, it is not possible 23

24 Index construction If a computer has not enough RAM memory, the index must be created on the hard drive. At any given moment, only some part of the data can be stored in RAM memory. Thus, the list of <term, document ID> pairs must be stored on the hard drive. It must also be sorted on the hard drive. It is not easy to write a software ( 软件 ) program that does this. This is some advanced discussion. For more details, see p.71 of the book 24

25 Several variations of indexing Several other approaches for indexing. Another one: 1. A dictionary is created (empty) in RAM memory. 2. Documents are read one by one to fill the dictionary. 3. If the memory is full the current dictionary is saved to disk and a new dictionary is created in memory. 4. The process continue to fill the new dictionary. 5. Finally, all the dictionaries needs to be merged to obtain a single dictionary. 25

26 Distributed indexing 分布式索引 Up to now, we have discussed about indexing on a single computer. For large document collections (e.g. the World Wide Web), indexing cannot be done efficiently using a single computer..solution: Create a distributed index ( 分布式索引 ). It is an index that is stored on many computers. 26

27 Distributed indexing 分布式索引 Distributed index The index is distributed on various computers either according to terms or documents. Here we will discuss indexes where the data is organized according to terms rather than documents. 27

28 Distributed indexing 分布式索引 In practice, distributed indexing is often done in the cloud ( 云计算 ) using technologies such as MapReduce What is the cloud? Many computers with standard parts (processor, memory, disk) that work together, up to a thousand computers, 28

Distributed indexing 分布式索引 In practice, distributed indexing is often done in the cloud ( 云计算 ) using technologies such as MapReduce What is the cloud?

29 Distributed indexing 分布式索引 In practice, distributed indexing is often done in the cloud ( 云计算 ) using technologies such as MapReduce What is the cloud? Many computers with standard parts (processor, memory, disk) that work together, up to a thousand computers, can survive the failure of some computers (multiple copies of the data is kept on multiple computers). 29

30 We will not talk about the details 30

31 Dynamic indexing ( 动态索引 ) We have until now assumed that a document collection is static (never changes, or is rarely changed). But most collections are not static New terms are added to the dictionary. New documents are added or removed (posting lists needs to be updated) 31

32 How to update a dictionary? Simple approach: Rebuild the dictionary periodically from scratch (e.g. every day). This is acceptable if the number of changes over time is small. the delay in making new documents searchable is acceptable. enough computer resources are available to construct a new index while the old one is still being used. 32

33 Dynamic indexing with two indexes If new documents needs to be indexed quickly: A main index is created to store documents and their posting lists An auxiliary index is kept in memory to store new documents and their posting lists. 33

34 Dynamic indexing with two indexes When searching for documents, the search is done on both indexes and the results are merged. Then, the result is shown to the user. Deletions: a list is used to keep track of documents that have been deleted. Updates: updated documents are removed from the indexes and inserted again. 34

35 Dynamic indexing with two indexes When the auxiliary index becomes too large, it is merged with the main index. This can be done periodically. 35

36 How indexes are stored? To store a dictionary, a file can be created for each term, containing its posting list. Shenzhen Beijing Brutus Automobile However, many computers cannot handle well a large amount of files. A better approach: the dictionary is stored in a single file or a database ( 数据库 ). Other solutions may also be used. 36

37 Performance Constructing a distributed index is more complicated than constructing an index that is stored on a single computer. But index construction and update can be very fast using a cloud (many computers). In practice, many search engine prefer to reconstruct the index from scratch, rather than trying to update it. More details 37

38 A main index is used for searching User ( 用户 ) searches for documents while a new index is being constructed. Indexer builds an updated index 38

39 Construction of positional indexes We previously discussed positional indexes. Positional index ( 位置索引 ): a dictionary where the positions of terms in documents are stored. Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35). Book20(3,500) This indicates that «Shenzhen» appears as the 2 nd, 24 th and 35 th word in Book

40 Construction of positional indexes Positional indexes are constructed in the same way as regular indexes. The main difference is that the position of terms in documents is kept and stored in the index. Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35). Book20(3,500) 40 40

41 Indexes for ranking Some IR systems rank documents from the most relevant to the least relevant. Most relevant Least relevant 41

42 Indexes for ranking The most relevant results should be shown first to the user. An approach is to sort the index by weight or impact (highest-weighted documents occur first in the index). This can allow to quickly stop a search for documents (since less important or unpopular documents are listed last). 42

43 Security for IR system Another important consideration of IR system is security. For example: Employees can search documents in the enterprise database. But some employees should not be able to access top-secret documents. Moreover, even the existence of a document can be sensitive ( 敏感的文件 ). Hence, the IR system should not show documents that a user cannot open. 43

44 How to ensure security? A solution: use an access control list ( 存取控制表 ). An access control list is a file that indicates the documents that each user can access. It can be viewed as a table (matrix) where rows are users and columns are documents. Documents Doc1 Doc2 Doc3 Doc4 User Users User User : can t read the document, 1 can read the document 44

45 How to ensure security? When a user searches for documents (e.g. user1): A set of documents is found that match the user s query using an inverted index (dictionary). {Doc1, Doc2, Doc3} Then, the intersection of these documents and the documents that the user can access is calculated. Doc1 Doc2 Doc3 Doc4 User {Doc1, Doc2, Doc3} The result is shown to the user: {Doc1} 45

46 How to ensure security? When a user searches for documents (e.g. user1): A set of documents is found that match the user s query using an inverted index (dictionary). {Doc1, Doc2, Doc3} Then, the intersection of these documents and the documents that the user can access is calculated. Doc1 Doc2 Doc3 Doc4 User {Doc1, Doc2, Doc3} The result is shown to the user: {Doc1} 46

47 CHAPTER 5: INDEX COMPRESSION pdf p122 47

48 Introduction An index or dictionary can be very large if there are many documents. Compression ( 压缩 ): the process of reducing the size of an index. Several compression techniques. May reduce storage space required by up to 75 %. Benefits 48

49 Benefits of compression 1) We can save some disk space. 2) More data can fit in memory. Thus, we can increase the use of caching ( 缓存 ) (keeping the most frequently accessed information in RAM memory, for faster access, and reducing the number of disk accesses). 3) Transferring data from disk to memory becomes faster because less data is transmitted (the data is compressed). 49

50 Time needed for compression Using compression requires to compress ( 压缩数据 ) and uncompress data ( 压缩数据 ). This is not a difficult task. It can be done very quickly by a computer. Thus, the cost of compression and decompression is small compared to the benefits obtained by compression. 50

51 Statistical properties of terms in IR Besides, if we apply preprocessing on a set of documents, the size of the dictionary will be reduced. An example: Reuters-RCV1 collection There are 485,494 terms. 51

52 Eliminating the 150 most common words from indexing cuts 25% to 30% of the non positional postings. 52

53 53

54 English vs other languages English: The Ofxford English Dictionary : 600,00 words. But this excludes names, numbers, scientific terms, etc. The reduction achieved by compression is greater for some languages e.g. French The reason is that French is a morphologically richer language ( 形态丰富的语言 ) than English. 54

55 Two types of compression Lossless compression ( 无损压缩 ): we reduce the space occupied by the data. but we do not lose any information. we will talk about this! Lossy compression ( 有损压缩 ): we reduce the space however some data is lost. can save more space. 55

56 Heaps law There is a law for estimating the number of terms in a collection of documents which is: NumberOfTerms = k x NumberOfTokens b In general: k 30, 100 b ~ 0.5 NumberOfTokens : the sum of the number of tokens in all documents. 56

57 Example: for 1 million words, we can expect approximately 38,000 different terms. In Reuters-RCV1, we have 38,365 words. The parameter k depends a lot on the nature of the documents and how it is processed. Case folding and stemming reduce the growth-rate of vocabulary. Spelling errors and numbers increase the vocabulary growth 57

58 vocabulary size relationship between collection size and vocabulary size is often linear in log log space collection size 58

59 Frequency of terms In real-life, few terms are accessed very often, many terms are rarely accessed. We can take advantage of this for dictionary compression 59

60 How to store the dictionary? Fixed length encoding: Each term is stored using a same amount of memory (e.g. 20 bytes for each term) Example: Problem: If we use a fixed amount of memory for each term, some memory is wasted because not all terms have the same number of characters! 60

61 How to store the dictionary? Fixed length encoding: Each term is stored using a same amount of memory (e.g. 20 bytes for each term) Example: Problem 2: If the chosen size for storing a term is too small, some long terms cannot be stored in the dictionary. In this example, terms with more than 20 characters cannot be stored. 61

62 Variable length encoding: Each term is stored using a variable amount of memory This can save a lot of memory! 62

63 Block encoding: each term is preceded by a number indicating the number of letters in the term. This allow to reduce the number of pointers. This can save a lot of memory! 63

This information can be used to further compress the

64 Front-coding If a dictionary is sorted, several consecutive words share the same prefix ( 前缀 ). This information can be used to further compress the dictionary. In this example, we don t need to store automat several times. This saves memory! 64

65 An illustration of the compression Explanation on next slide 65

66 Explanation of the previous slide We have several words : automata, automate, automatic, automation. We want to compress this data to make it smaller. Since all these words start with automat we write: 8automat <-- Here 8 is the number of letters in "automat" Then, we write automata has follows: *a <-- This means that it is the same as "automat" but we must add character "a" to get "automata" Then, we write automate has follows: 1 e <-- This means that it is the same as "automat" but we must add 1 character which is "e" to get "automate" Then, we write automate has follows: 2 ic <-- This means that it is the same as "automat" but we must add 2 characters which is "ic" to get "automatic" Then, we write automate has follows: 3 ion <-- This means that it is the same as "automat" but we must add 3 characters which is "ion" to get "automation" 66

67 How much reduction? 67

68 Compression of posting lists It is also possible to compress posting lists. Normally, in a dictionary, for each term, we store the full list of documents where it appears. Each document is represented by a number (identifier), which uses a fixed amount of memory. To save memory, we can use a variable amount of memory to store the identifier of documents. Many approaches. See book p

69 Compression vs Dictionary size 3600 MB for the collection of documents 107 MB for storing the index ( ) 69

70 Conclusion Today, we have quickly discussed chapter 4 and 5. We will continue next week The PPT slides are on the website. 70

71 References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press,

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Last week We have discussed in