信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
|
|
- Bryan Lawrence
- 5 years ago
- Views:
Transcription
1 信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities Spring
2 Last week We have discussed about: Hashing ( 散列 ) and search trees ( 搜索树 ) Wildcard queries Spell correction QQ Group: Website: PPTs 2
3 Course schedule ( 日程安排 ) Lecture 1 Lecture 2 Lecture 3 Lecture 4 Lecture 5 Lecture 6 Lecture 7 Introduction Boolean retrieval ( 布尔检索模型 ) Term vocabulary and posting lists Dictionaries and tolerant retrieval Index construction and compression Scoring, weighting, and the vector space model Computer scores, and a complete search system Evaluation in information retrieval Web search engines, advanced topics, and conclusion 3
4 PHONETIC ( 语音的 ) CORRECTION Write Right Rite Wright 4
5 Phonetic correction Misspellings are often caused by a user typing a query that sounds like the target term. Phonetic hashing: try to group together all terms that sound similar. 5
6 Soundex algorithms 1. Turn every term to be indexed into a 4- character reduced form Hermann H Use these character to create an inverted index (dictionary 词典 ). The dictionary is called soundex index 3. Do the same with query terms 4. When a new query arrives, search using the soundex index. 6
7 How to calculate the 4 character codes? 1. Retain the first letter of the term. 2. Change all occurrences of the following letters to 0 (zero): A,E, I, O, U, H, W, Y 3. Change letters to digits as follows: B, F, P, V to 1. C, G, J, K, Q, S, X, Z to 2. D,T to 3. L to 4. M, N to 5. R to Repeatedly remove one out of each pair of consecutive identical digits 5. Remove all zeros from the resulting text. Pad the resulting text with trailing zeros and return the first four positions, which will consist of a letter followed by three digits. 7
8 Observation about Soundex Vowels ( 元音 ) are viewed as interchangeable in transcribing names A,E, I, O, U, H, W, Y. Consonants ( 辅音 ) with similar sounds are considered to be the same. e.g. D and T These rules work for most European languages. 8
9 CHAPTER 4 INDEX CONSTRUCTION PDF p.104 9
10 Introduction We will talk about how to construct an inverted index. This process is called index construction or indexing ( 索引 ). It is performed by some software called an indexer ( 索引器 ). A collection of documents An inverted index to search for documents Doc 1 Book1 Indexer Book1 Book1 Doc2 10
11 Introduction For a Web search engine like Baidu or Bing, the indexer is called a spider or web crawler ( 网络爬虫 ). A web crawler is a software that will browse the internet periodically to update its index of webpages. 11
12 Types of IR systems There are: small scale Information Retrieval systems (e.g. to search documents in a company) large scale Information Retrieval systems (e.g. to search the Web). In general, we want an IR system to be fast. Thus, characteristics of the computer hardware ( 计算机硬件 ) must be considered. 12
13 Computer memory There are two main types of memory in a computer: Hard drive ( 硬盘驱动器 ) RAM memory (RAM 芯片 ) Permanent storage Cheaper Slower Temporary storage Expensive Fast 13
14 About hardware 1) Data access time ( 访问时间 ) Accessing the data in RAM is faster than accessing the data in a hard drive. To increase the speed of an IR system we should keep as much data as possible in RAM. We may use a computer having several gigabytes (GB) of RAM for an IR system. A technique called caching ( 缓存 ) consists of keeping the most frequently accessed data in RAM memory. 14
15 About hardware 2) How the data is organized is important How the data is organized in memory also influences how fast the data can be read or written. In general, if the data that we read is stored contiguously ( 连续的 ) on the hard drive, then reading the data will be faster than if the data is not stored contiguously. Data stored contiguously Data not stored contiguously
16 About hardware 3) Data compression ( 数据压缩 ) can reduce the time for reading data on the hard drive Data compression refers to techniques for reducing the size of the data. If the data is smaller, reading it is faster. Uncompressed data Compressed data 16
17 Simple approach for index construction Step 1. Each document from the collection is read. For each word, a <term, document ID > pair is created. e.g. this indicates that the term Brutus appears in document #1 17
18 Simple approach for index construction Step 1. Each document from the collection is read. For each word, a <term, document ID > pair is created. e.g. this indicates that the term Brutus also appears in document #2 18
19 Step 2. All the pairs are sorted alphabetically Thus, all pairs representing the same term now appears consecutively. e.g. was 19
20 Step 3. The pairs with same terms are then combined to create the inverted index (dictionary) 20
21 Step 3. The pairs are then used to create the inverted index (dictionary) The term Brutus The frequency of this term (optional). Brutus appears in 2 documents The posting list. Brutus appears in documents 1 and 2 21
22 Example Reuters-RCV1: a collection of about 800,000 news documents published between August 20, 1996 and August 19, GB of text, average: 200 tokens per document 400,000 terms 22
23 Example (cont d) 100 million tokens Each token requires 32 bits of memory Storing the texts takes 0.8 GB This collection of documents can fit in the memory of a desktop computer. However, for larger document collections, it is not possible 23
24 Index construction If a computer has not enough RAM memory, the index must be created on the hard drive. At any given moment, only some part of the data can be stored in RAM memory. Thus, the list of <term, document ID> pairs must be stored on the hard drive. It must also be sorted on the hard drive. It is not easy to write a software ( 软件 ) program that does this. This is some advanced discussion. For more details, see p.71 of the book 24
25 Several variations of indexing Several other approaches for indexing. Another one: 1. A dictionary is created (empty) in RAM memory. 2. Documents are read one by one to fill the dictionary. 3. If the memory is full the current dictionary is saved to disk and a new dictionary is created in memory. 4. The process continue to fill the new dictionary. 5. Finally, all the dictionaries needs to be merged to obtain a single dictionary. 25
26 Distributed indexing 分布式索引 Up to now, we have discussed about indexing on a single computer. For large document collections (e.g. the World Wide Web), indexing cannot be done efficiently using a single computer..solution: Create a distributed index ( 分布式索引 ). It is an index that is stored on many computers. 26
27 Distributed indexing 分布式索引 Distributed index The index is distributed on various computers either according to terms or documents. Here we will discuss indexes where the data is organized according to terms rather than documents. 27
28 Distributed indexing 分布式索引 In practice, distributed indexing is often done in the cloud ( 云计算 ) using technologies such as MapReduce What is the cloud? Many computers with standard parts (processor, memory, disk) that work together, up to a thousand computers, 28
29 Distributed indexing 分布式索引 In practice, distributed indexing is often done in the cloud ( 云计算 ) using technologies such as MapReduce What is the cloud? Many computers with standard parts (processor, memory, disk) that work together, up to a thousand computers, can survive the failure of some computers (multiple copies of the data is kept on multiple computers). 29
30 We will not talk about the details 30
31 Dynamic indexing ( 动态索引 ) We have until now assumed that a document collection is static (never changes, or is rarely changed). But most collections are not static New terms are added to the dictionary. New documents are added or removed (posting lists needs to be updated) 31
32 How to update a dictionary? Simple approach: Rebuild the dictionary periodically from scratch (e.g. every day). This is acceptable if the number of changes over time is small. the delay in making new documents searchable is acceptable. enough computer resources are available to construct a new index while the old one is still being used. 32
33 Dynamic indexing with two indexes If new documents needs to be indexed quickly: A main index is created to store documents and their posting lists An auxiliary index is kept in memory to store new documents and their posting lists. 33
34 Dynamic indexing with two indexes When searching for documents, the search is done on both indexes and the results are merged. Then, the result is shown to the user. Deletions: a list is used to keep track of documents that have been deleted. Updates: updated documents are removed from the indexes and inserted again. 34
35 Dynamic indexing with two indexes When the auxiliary index becomes too large, it is merged with the main index. This can be done periodically. 35
36 How indexes are stored? To store a dictionary, a file can be created for each term, containing its posting list. Shenzhen Beijing Brutus Automobile However, many computers cannot handle well a large amount of files. A better approach: the dictionary is stored in a single file or a database ( 数据库 ). Other solutions may also be used. 36
37 Performance Constructing a distributed index is more complicated than constructing an index that is stored on a single computer. But index construction and update can be very fast using a cloud (many computers). In practice, many search engine prefer to reconstruct the index from scratch, rather than trying to update it. More details 37
38 A main index is used for searching User ( 用户 ) searches for documents while a new index is being constructed. Indexer builds an updated index 38
39 Construction of positional indexes We previously discussed positional indexes. Positional index ( 位置索引 ): a dictionary where the positions of terms in documents are stored. Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35). Book20(3,500) This indicates that «Shenzhen» appears as the 2 nd, 24 th and 35 th word in Book
40 Construction of positional indexes Positional indexes are constructed in the same way as regular indexes. The main difference is that the position of terms in documents is kept and stored in the index. Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35). Book20(3,500) 40 40
41 Indexes for ranking Some IR systems rank documents from the most relevant to the least relevant. Most relevant Least relevant 41
42 Indexes for ranking The most relevant results should be shown first to the user. An approach is to sort the index by weight or impact (highest-weighted documents occur first in the index). This can allow to quickly stop a search for documents (since less important or unpopular documents are listed last). 42
43 Security for IR system Another important consideration of IR system is security. For example: Employees can search documents in the enterprise database. But some employees should not be able to access top-secret documents. Moreover, even the existence of a document can be sensitive ( 敏感的文件 ). Hence, the IR system should not show documents that a user cannot open. 43
44 How to ensure security? A solution: use an access control list ( 存取控制表 ). An access control list is a file that indicates the documents that each user can access. It can be viewed as a table (matrix) where rows are users and columns are documents. Documents Doc1 Doc2 Doc3 Doc4 User Users User User : can t read the document, 1 can read the document 44
45 How to ensure security? When a user searches for documents (e.g. user1): A set of documents is found that match the user s query using an inverted index (dictionary). {Doc1, Doc2, Doc3} Then, the intersection of these documents and the documents that the user can access is calculated. Doc1 Doc2 Doc3 Doc4 User {Doc1, Doc2, Doc3} The result is shown to the user: {Doc1} 45
46 How to ensure security? When a user searches for documents (e.g. user1): A set of documents is found that match the user s query using an inverted index (dictionary). {Doc1, Doc2, Doc3} Then, the intersection of these documents and the documents that the user can access is calculated. Doc1 Doc2 Doc3 Doc4 User {Doc1, Doc2, Doc3} The result is shown to the user: {Doc1} 46
47 CHAPTER 5: INDEX COMPRESSION pdf p122 47
48 Introduction An index or dictionary can be very large if there are many documents. Compression ( 压缩 ): the process of reducing the size of an index. Several compression techniques. May reduce storage space required by up to 75 %. Benefits 48
49 Benefits of compression 1) We can save some disk space. 2) More data can fit in memory. Thus, we can increase the use of caching ( 缓存 ) (keeping the most frequently accessed information in RAM memory, for faster access, and reducing the number of disk accesses). 3) Transferring data from disk to memory becomes faster because less data is transmitted (the data is compressed). 49
50 Time needed for compression Using compression requires to compress ( 压缩数据 ) and uncompress data ( 压缩数据 ). This is not a difficult task. It can be done very quickly by a computer. Thus, the cost of compression and decompression is small compared to the benefits obtained by compression. 50
51 Statistical properties of terms in IR Besides, if we apply preprocessing on a set of documents, the size of the dictionary will be reduced. An example: Reuters-RCV1 collection There are 485,494 terms. 51
52 Eliminating the 150 most common words from indexing cuts 25% to 30% of the non positional postings. 52
53 53
54 English vs other languages English: The Ofxford English Dictionary : 600,00 words. But this excludes names, numbers, scientific terms, etc. The reduction achieved by compression is greater for some languages e.g. French The reason is that French is a morphologically richer language ( 形态丰富的语言 ) than English. 54
55 Two types of compression Lossless compression ( 无损压缩 ): we reduce the space occupied by the data. but we do not lose any information. we will talk about this! Lossy compression ( 有损压缩 ): we reduce the space however some data is lost. can save more space. 55
56 Heaps law There is a law for estimating the number of terms in a collection of documents which is: NumberOfTerms = k x NumberOfTokens b In general: k 30, 100 b ~ 0.5 NumberOfTokens : the sum of the number of tokens in all documents. 56
57 Example: for 1 million words, we can expect approximately 38,000 different terms. In Reuters-RCV1, we have 38,365 words. The parameter k depends a lot on the nature of the documents and how it is processed. Case folding and stemming reduce the growth-rate of vocabulary. Spelling errors and numbers increase the vocabulary growth 57
58 vocabulary size relationship between collection size and vocabulary size is often linear in log log space collection size 58
59 Frequency of terms In real-life, few terms are accessed very often, many terms are rarely accessed. We can take advantage of this for dictionary compression 59
60 How to store the dictionary? Fixed length encoding: Each term is stored using a same amount of memory (e.g. 20 bytes for each term) Example: Problem: If we use a fixed amount of memory for each term, some memory is wasted because not all terms have the same number of characters! 60
61 How to store the dictionary? Fixed length encoding: Each term is stored using a same amount of memory (e.g. 20 bytes for each term) Example: Problem 2: If the chosen size for storing a term is too small, some long terms cannot be stored in the dictionary. In this example, terms with more than 20 characters cannot be stored. 61
62 Variable length encoding: Each term is stored using a variable amount of memory This can save a lot of memory! 62
63 Block encoding: each term is preceded by a number indicating the number of letters in the term. This allow to reduce the number of pointers. This can save a lot of memory! 63
64 Front-coding If a dictionary is sorted, several consecutive words share the same prefix ( 前缀 ). This information can be used to further compress the dictionary. In this example, we don t need to store automat several times. This saves memory! 64
65 An illustration of the compression Explanation on next slide 65
66 Explanation of the previous slide We have several words : automata, automate, automatic, automation. We want to compress this data to make it smaller. Since all these words start with automat we write: 8automat <-- Here 8 is the number of letters in "automat" Then, we write automata has follows: *a <-- This means that it is the same as "automat" but we must add character "a" to get "automata" Then, we write automate has follows: 1 e <-- This means that it is the same as "automat" but we must add 1 character which is "e" to get "automate" Then, we write automate has follows: 2 ic <-- This means that it is the same as "automat" but we must add 2 characters which is "ic" to get "automatic" Then, we write automate has follows: 3 ion <-- This means that it is the same as "automat" but we must add 3 characters which is "ion" to get "automation" 66
67 How much reduction? 67
68 Compression of posting lists It is also possible to compress posting lists. Normally, in a dictionary, for each term, we store the full list of documents where it appears. Each document is represented by a number (identifier), which uses a fixed amount of memory. To save memory, we can use a variable amount of memory to store the identifier of documents. Many approaches. See book p
69 Compression vs Dictionary size 3600 MB for the collection of documents 107 MB for storing the index ( ) 69
70 Conclusion Today, we have quickly discussed chapter 4 and 5. We will continue next week The PPT slides are on the website. 70
71 References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press,
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Last week We have discussed in
More information信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Last week We have discussed: A
More information信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2018 1 Last week What is Information Retrieval
More information信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Introduction Philippe Fournier-Viger
More information信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Last week We have discussed: Evaluation
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 Information Retrieval Lecture 6: Index Compression 6 Last Time: index construction Sort- based indexing Blocked Sort- Based Indexing Merge sort is effective
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval 1 Outline Dictionaries Wildcard queries skip Edit distance skip Spelling correction skip Soundex 2 Inverted index Our
More informationAdministrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks
Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing
More information云计算入门 Introduction to Cloud Computing GESC1001
Lecture #3 云计算入门 Introduction to Cloud Computing GESC1001 Philippe Fournier-Viger Professor School of Humanities and Social Sciences philfv8@yahoo.com Fall 2018 1 Course schedule Part 1 Part 2 Part 3 Introduction
More information云计算入门 Introduction to Cloud Computing GESC1001
Lecture #6 云计算入门 Introduction to Cloud Computing GESC1001 Philippe Fournier-Viger Professor School of Humanities and Social Sciences philfv8@yahoo.com Fall 2017 1 Introduction Last week: how cloud applications
More informationIndex Compression. David Kauchak cs160 Fall 2009 adapted from:
Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?
More informationWeb Information Retrieval. Lecture 4 Dictionaries, Index Compression
Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many
More informationRecap: lecture 2 CS276A Information Retrieval
Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo
More informationInverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5
Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the
More informationINDEX CONSTRUCTION 1
1 INDEX CONSTRUCTION PLAN Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden This time: mo among amortize Index construction on
More informationInformation Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007
Information Retrieval Lecture 3 - Index compression Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Dictionary and inverted index:
More informationUnderstanding IO patterns of SSDs
固态硬盘 I/O 特性测试 周大 众所周知, 固态硬盘是一种由闪存作为存储介质的数据库存储设备 由于闪存和磁盘之间物理特性的巨大差异, 现有的各种软件系统无法直接使用闪存芯片 为了提供对现有软件系统的支持, 往往在闪存之上添加一个闪存转换层来实现此目的 固态硬盘就是在闪存上附加了闪存转换层从而提供和磁盘相同的访问接口的存储设备 一方面, 闪存本身具有独特的访问特性 另外一方面, 闪存转换层内置大量的算法来实现闪存和磁盘访问接口之间的转换
More informationCourse work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?
Course work Introduc)on to Informa(on Retrieval Problem set 1 due Thursday Programming exercise 1 will be handed out today CS276: Informa)on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan
More information3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with
More informationIndexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information
More informationCS60092: Informa0on Retrieval
Introduc)on to CS60092: Informa0on Retrieval Sourangshu Bha1acharya Last lecture index construc)on Sort- based indexing Naïve in- memory inversion Blocked Sort- Based Indexing Merge sort is effec)ve for
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep
More informationindex construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap
to to Information Retrieval Index Construct Ruixuan Li Huazhong University of Science and Technology http://idc.hust.edu.cn/~rxli/ October, 2012 1 2 How to construct index? Computerese term document docid
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards This time: Spell correction Soundex Index construction Index
More informationIndex Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search
Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query
More informationInformation Retrieval 6. Index compression
Ghislain Fourny Information Retrieval 6. Index compression Picture copyright: donest /123RF Stock Photo What we have seen so far 2 Boolean retrieval lawyer AND Penang AND NOT silver query Input Set of
More informationCorso di Biblioteche Digitali
Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto
More informationIntroduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Hamid Rastegari Lecture 4: Index Construction Plan Last lecture: Dictionary data structures
More informationCS347. Lecture 2 April 9, Prabhakar Raghavan
CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Today s topics Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card
More informationToday s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan
Today s topics CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card
More informationData-analysis and Retrieval Boolean retrieval, posting lists and dictionaries
Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries Hans Philippi (based on the slides from the Stanford course on IR) April 25, 2018 Boolean retrieval, posting lists & dictionaries
More information如何查看 Cache Engine 缓存中有哪些网站 /URL
如何查看 Cache Engine 缓存中有哪些网站 /URL 目录 简介 硬件与软件版本 处理日志 验证配置 相关信息 简介 本文解释如何设置处理日志记录什么网站 /URL 在 Cache Engine 被缓存 硬件与软件版本 使用这些硬件和软件版本, 此配置开发并且测试了 : Hardware:Cisco 缓存引擎 500 系列和 73xx 软件 :Cisco Cache 软件版本 2.3.0
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationElementary IR: Scalable Boolean Text Search. (Compare with R & G )
Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context
More informationIndex Construction 1
Index Construction 1 October, 2009 1 Vorlage: Folien von M. Schütze 1 von 43 Index Construction Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin
More informationChapter 11 SHANDONG UNIVERSITY 1
Chapter 11 File System Implementation ti SHANDONG UNIVERSITY 1 Contents File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency and
More informationLecture 3 Index Construction and Compression. Many thanks to Prabhakar Raghavan for sharing most content from the following slides
Lecture 3 Index Construction and Compression Many thanks to Prabhakar Raghavan for sharing most content from the following slides Recap of the previous lecture Tokenization Term equivalence Skip pointers
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationIndex Construction. Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval ΕΠΛ660 Ανάκτηση Πληροφοριών και Μηχανές Αναζήτησης ης Index Construction ti Introduction to Information Retrieval Plan Last lecture: Dictionary data structures Tolerant
More informationMultimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency
Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following
More informationInformation Retrieval
Introduction to CS3245 Lecture 5: Index Construction 5 Last Time Dictionary data structures Tolerant retrieval Wildcards Spelling correction Soundex a-hu hy-m n-z $m mace madden mo among amortize on abandon
More information数据挖掘 Introduction to Data Mining
数据挖掘 Introduction to Data Mining Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 S8700113C 1 Introduction Last week: Association Analysis
More informationInformation Retrieval
Introduction to CS3245 Lecture 5: Index Construction 5 CS3245 Last Time Dictionary data structures Tolerant retrieval Wildcards Spelling correction Soundex a-hu hy-m n-z $m mace madden mo among amortize
More informationIntroduction to Computer Science
Introduction to Computer Science 郝建业副教授 软件学院 http://www.escience.cn/people/jianye/index.html Lecturer Jianye HAO ( 郝建业 ) Email: jianye.hao@tju.edu.cn Tutor: Li Shuxin ( 李姝昕 ) Email: 957005030@qq.com Outline
More informationChapter 2. Architecture of a Search Engine
Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them
More informationInforma(on Retrieval
Introduc*on to Informa(on Retrieval CS276: Informa*on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 4: Index Construc*on Plan Last lecture: Dic*onary data structures Tolerant retrieval
More information计算机信息表达. Information Representation 刘志磊天津大学智能与计算学部
计算机信息表达 刘志磊天津大学智能与计算学部 Bits & Bytes Bytes & Letters More Bytes Bit ( 位 ) the smallest unit of storage Everything in a computer is 0 s and 1 s Bits why? Computer Hardware Chip uses electricity 0/1 states
More informationmodern database systems lecture 4 : information retrieval
modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation
More informationIntroduction to Information Retrieval
Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural
More informationBi-monthly report. Tianyi Luo
Bi-monthly report Tianyi Luo 1 Work done in this week Write a crawler plus based on keywords (Support Chinese and English) Modify a Sina weibo crawler (340M/day) Offline learning to rank module is completed
More informationOutline of the course
Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library
More informationCS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University
CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression
More informationCS105 Introduction to Information Retrieval
CS105 Introduction to Information Retrieval Lecture: Yang Mu UMass Boston Slides are modified from: http://www.stanford.edu/class/cs276/ Information Retrieval Information Retrieval (IR) is finding material
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More information: Operating System 计算机原理与设计
.. 0117401: Operating System 计算机原理与设计 Chapter 11: File system interface( 文件系统接口 ) 陈香兰 xlanchen@ustc.edu.cn http://staff.ustc.edu.cn/~xlanchen Computer Application Laboratory, CS, USTC @ Hefei Embedded
More informationEECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling
EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:
More informationThe State and Opportunities of HPC Applications in China. Ruibo Wang National University of Defense Technology
The State and Opportunities of HPC Applications in China Ruibo Wang National University of Defense Technology Outline Brief introduction to the Sites Applications Fusion Development of HPC, Cloud & Big
More information第二小题 : 逻辑隔离 (10 分 ) OpenFlow Switch1 (PC-A/Netfpga) OpenFlow Switch2 (PC-B/Netfpga) ServerB PC-2. Switching Hub
第二小题 : 逻辑隔离 (10 分 ) 一 实验背景云平台服务器上的不同虚拟服务器, 分属于不同的用户 用户远程登录自己的虚拟服务器之后, 安全上不允许直接访问同一局域网的其他虚拟服务器 二 实验目的搭建简单网络, 通过逻辑隔离的方法, 实现用户能远程登录局域网内自己的虚拟内服务器, 同时不允许直接访问同一局域网的其他虚拟服务器 三 实验环境搭建如图 1-1 所示, 我们会创建一个基于 OpenFlow
More informationInformation Retrieval
Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 1: Boolean retrieval Information Retrieval Information Retrieval (IR)
More informationInformation Retrieval
Information Retrieval Dictionaries & Tolerant Retrieval Gintarė Grigonytė gintare@ling.su.se Department of Linguistics and Philology Uppsala University Slides based on previous IR course given by Jörg
More informationTechnology: Anti-social Networking 科技 : 反社交网络
Technology: Anti-social Networking 科技 : 反社交网络 1 Technology: Anti-social Networking 科技 : 反社交网络 The Growth of Online Communities 社交网络使用的增长 Read the text below and do the activity that follows. 阅读下面的短文, 然后完成练习
More informationIntroduc)on to. CS60092: Informa0on Retrieval
Introduc)on to CS60092: Informa0on Retrieval Ch. 4 Index construc)on How do we construct an index? What strategies can we use with limited main memory? Sec. 4.1 Hardware basics Many design decisions in
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:
More informationIndex Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson
Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Index Construction Overview Introduction
More informationInformation Retrieval
Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid
More informationGUJARAT TECHNOLOGICAL UNIVERSITY
GUJARAT TECHNOLOGICAL UNIVERSITY INFORMATION TECHNOLOGY DATA COMPRESSION AND DATA RETRIVAL SUBJECT CODE: 2161603 B.E. 6 th SEMESTER Type of course: Core Prerequisite: None Rationale: Data compression refers
More informationDigital Libraries: Language Technologies
Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................
More informationRecap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval
Ch. 2 Recap of the previous lecture Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval The type/token distinction Terms are normalized types put in the dictionary Tokenization
More information1 o Semestre 2007/2008
Efficient Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 6 7 Outline 1 2 3 4 5 6 7 Text es An index is a mechanism to locate a given term in
More information3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
3-1. Dictionaries and Tolerant Retrieval Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Dictionary data structures for inverted indexes Sec. 3.1 The dictionary
More informationPart 2: Boolean Retrieval Francesco Ricci
Part 2: Boolean Retrieval Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content p Term document matrix p Information
More informationInformation Retrieval. Lecture 10 - Web crawling
Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the
More informationColor LaserJet Pro MFP M477 入门指南
Color LaserJet Pro MFP M477 入门指南 Getting Started Guide 2 www.hp.com/support/colorljm477mfp www.register.hp.com ZHCN 4. 在控制面板上进行初始设置...2 5. 选择一种连接方式并准备安装软件...2 6. 找到或下载软件安装文件...3 7. 安装软件...3 8. 移动和无线打印
More informationMachine Vision Market Analysis of 2015 Isabel Yang
Machine Vision Market Analysis of 2015 Isabel Yang CHINA Machine Vision Union Content 1 1.Machine Vision Market Analysis of 2015 Revenue of Machine Vision Industry in China 4,000 3,500 2012-2015 (Unit:
More informationEncoding. A thesis submitted to the Graduate School of University of Cincinnati in
Lossless Data Compression for Security Purposes Using Huffman Encoding A thesis submitted to the Graduate School of University of Cincinnati in a partial fulfillment of requirements for the degree of Master
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationHBase 在 hulu 的使用和实践. hulu
HBase 在 hulu 的使用和实践 张虔熙 @ hulu qianxi.zhang@hulu.com About hulu About me 张虔熙 ü 软件工程师 @Hulu 大数据平台组 ü 专注于分布式计算和存储技术 ü 热衷于参与开源社区贡献代码 üqianxi.zhang@hulu.com Agenda Overview Audience Platform( 用户画像系统 ) Auto
More informationOverview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components
Overview Lecture 3: Index Representation and Tolerant Retrieval Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group 1 Recap 2
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 01 Boolean Retrieval Example IR Problem Let s look at a simple IR problem Suppose you own a copy of Shakespeare
More informationCSE 562 Database Systems
Goal of Indexing CSE 562 Database Systems Indexing Some slides are based or modified from originals by Database Systems: The Complete Book, Pearson Prentice Hall 2 nd Edition 08 Garcia-Molina, Ullman,
More informationOTAD Application Note
OTAD Application Note Document Title: OTAD Application Note Version: 1.0 Date: 2011-08-30 Status: Document Control ID: Release _OTAD_Application_Note_CN_V1.0 Copyright Shanghai SIMCom Wireless Solutions
More informationCommand Dictionary CUSTOM
命令模式 CUSTOM [(filename)] [parameters] Executes a "custom-designed" command which has been provided by special programming using the GHS Programming Interface. 通过 GHS 程序接口, 执行一个 用户设计 的命令, 该命令由其他特殊程序提供 参数说明
More informationInstructor: Stefan Savev
LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information
More informationpublic static InetAddress getbyname(string host) public static InetAddress getlocalhost() public static InetAddress[] getallbyname(string host)
网络编程 杨亮 网络模型 访问 网络 Socket InetAddress 类 public static InetAddress getbyname(string host) public static InetAddress getlocalhost() public static InetAddress[] getallbyname(string host) public class OreillyByName
More informationBoolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok
Boolean Retrieval Manning, Raghavan and Schütze, Chapter 1 Daniël de Kok Boolean query model Pose a query as a boolean query: Terms Operations: AND, OR, NOT Example: Brutus AND Caesar AND NOT Calpuria
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 7: Scores in a Complete Search System Paul Ginsparg Cornell University, Ithaca,
More information2.8 Megapixel industrial camera for extreme environments
Prosilica GT 1920 Versatile temperature range for extreme environments PTP PoE P-Iris and DC-Iris lens control 2.8 Megapixel industrial camera for extreme environments Prosilica GT1920 is a 2.8 Megapixel
More information