Information Retrieval 6. Index compression
|
|
- Scot Ferguson
- 5 years ago
- Views:
Transcription
1 Ghislain Fourny Information Retrieval 6. Index compression Picture copyright: donest /123RF Stock Photo
2 What we have seen so far 2
3 Boolean retrieval lawyer AND Penang AND NOT silver query Input Set of documents Output Subset of documents 3
4 Standard inverted index ETH Zürich computer data CPU information retrieval
5 Search structures Hash tables Trees (B, B+) 5
6 Additional features comput* cmputer Wildcards Spell correction "Pfäffikon SZ" Phrase search 6
7 Bi-word indices (Phrase search feature) Help ETH Zurich to flexibly react to new challenges and to set new accents in the future. Index Help ETH ETH Zurich Zurich to to flexibly flexibly react react to 7
8 Positional index (phrase search feature) "ETH Zurich" Help C,1: 1 ETH C,1: 2 Zurich C,1: 3 to C,3: 4, 7, 11 flexibly C,1: 5 react C,1: 6 8
9 Trigram index (wildcard, spell correction) mpu com ran ter $co $te an$ com er$ err mpu omp put ran rra ter ute computer terran terran computer computer computer terran terran terran computer computer computer computer terran 9
10 TermIDs t1 t2 t3 t4 t5 t6 t7 1 t1 2 t1 3 3 t2 4 t2 7 1 t3 2 t3 4 1 t4 3 t4 5 2 t5 3 t5 4 1 t6 2 t6 4 3 t7 5 t
11 Blocked Sort-Based Indexing 11
12 Single-Pass In-Memory Indexing 12
13 Auxiliary Index ETH Computer Information Course Auxiliary index ETH Computer Information 5 4 Main index Course
14 Logarithmic Merging n postings 2n postings 4n postings Z 0 Z 1 I 0 I 2 14
15 Term Statistics 15
16 Number of terms 16
17 Number of terms 17
18 Notations used in the book N: number of documents T: number of tokens (non-positional postings) M: number of terms (or types if stemming/lemmatization) 18
19 Number of terms # Terms? # Tokens 19
20 Number of terms # Terms? # Tokens 20
21 Number of terms # Terms? # Tokens 21
22 Number of terms # Terms? # Tokens 22
23 Number of terms # Terms? max # Tokens 23
24 Number of terms # Terms We when it's linear # Tokens 24
25 Log-log scale (M) log # Terms We when it's linear log # Tokens (T) 25
26 Log-log scale (M) log # Terms log M = b log T + a We when it's linear log # Tokens (T) 26
27 "Exponential" growth (M) # Terms M = e a T b # Tokens (T) 27
28 Heaps' law (M) # Terms M = kt b # Tokens (T) 28
29 In practice (M) # Terms M = kt b b 1 2 # Tokens (T) 29
30 In practice (M) # Terms M = k p T # Tokens (T) 30
31 In practice (M) # Terms M = k p T 30 apple k apple 100 # Tokens (T) 31
32 Distribution of terms 32
33 Distribution of terms the: 56,271,872 were: 3,323,884 nearer: 51,456 moderate: 19,245 champion: 9400 stocks: 6,537 parallelogram: 503 pachyderm: 79 capacitance: 45 germanium: 12 sesquipedal: 7 33
34 Distribution of terms # Tokens the of and to in I was Rank 34
35 35 Distribution of terms # Tokens Rank
36 log-log scale log # Tokens log Rank 36
37 Zipf's law log Frequency = a log Rank + b 37
38 Zipf's law log Frequency = a log Rank + b log # Tokens log Rank 38
39 Zipf's law log # Tokens log Frequency = b log Rank log Rank 39
40 Zipf's law Frequency = k Rank
41 Compression techniques already covered 41
42 Compression techniques already covered Remove numbers 42
43 Compression techniques already covered Remove numbers Apple apple Case folding 43
44 Compression techniques already covered Remove numbers Apple apple Case folding and of the Remove stopwords 44
45 Compression techniques already covered Remove numbers Apple apple Case folding and of the Remove stopwords computing compute Stemming 45
46 Compression techniques already covered Remove numbers Apple apple and of the Case folding Remove stopwords This reduces the size of the dictionary! computing compute Stemming 46
47 Impact (number of terms/types) Remove numbers -2% Apple apple Case folding -17% -33% and of the Remove stopwords -0% computing compute Stemming -17% Source: Information Retrieval book 47
48 Impact (number of postings) Remove numbers -8% Apple apple Case folding -3% -42% and of the Remove stopwords -30% computing compute Stemming -4% Source: Information Retrieval book 48
49 Impact (number of tokens) Remove numbers -9% Apple apple Case folding -0% -52% and of the Remove stopwords -47% computing compute Stemming -0% Source: Information Retrieval book 49
50 Dictionary compression 50
51 Standard inverted index ETH Zürich computer data CPU information retrieval
52 Standard inverted index ETH Zürich computer data CPU information retrieval Let us start compressing the dictionary. 52
53 Status quo 53
54 Status quo: Dictionary stored as a B+ tree possess come is merely that thy upon almost be carefully is it Laertes possess should take thy time to come fair hour merely most my that thine this upon you your 54
55 Status quo: Dictionary stored as a B+ tree possess come is merely that thy upon almost be carefully is it Laertes possess should take thy time to come fair hour merely most my that thine this upon you your Pointers to postings lists 55
56 Status quo: Dictionary stored as a B+ tree possess come is merely that thy upon almost be carefully is it Laertes possess should take thy time to come fair hour merely most my that thine this upon you your Pointers to postings lists 56
57 Standard inverted index ETH Zürich computer data CPU information retrieval Let us start compressing the dictionary. 57
58 Standard inverted index ETH Zürich computer data CPU information retrieval We can then make it fit in RAM. 58
59 Approach 1: Array computer... CPU... data... ETH... information.. retrieval... Zürich
60 Approach 1: Array computer... CPU... data... ETH... information.. retrieval... Zürich bytes 4 bytes 4 bytes 60
61 Approach 1: Issue computer... CPU... data... ETH... information.. retrieval zupercalifragilisticexpialidocious 61
62 Approach 2: String computercpudataethinformationretrievalzürich 62
63 Approach 2: String computercpudataethinformationretrievalzürich bytes 4 bytes 63
64 Approach 2: String computercpudataethinformationretrievalzürich bytes 4 bytes 3 bytes 64
65 Approach 2: String computercpudataethinformationretrievalzürich bytes 4 bytes 3 bytes (+8 bytes) 65
66 Approach 3: Blocked storage 8computer3CPU4data3ETH11information9retrieval Only every k terms k 4 bytes 4 bytes bytes (+9 bytes) 66
67 No free lunch 67
68 No free lunch 68
69 No free lunch 69
70 Compromise between space and time 70
71 Binary search steps (no blocking) ETH CPU retrieval computer data information Zürich 71
72 Binary search steps (no blocking) ETH CPU One extra "memory seek" retrieval computer data information Zürich 72
73 Binary search steps (no blocking) ETH CPU retrieval computer data Two extra "memory seeks" information Zürich 73
74 Binary search steps (no blocking) ETH CPU retrieval computer data information Zürich Average: avg(0,1,2,2,1,2,2) =
75 Binary search steps (with blocking) ETH computer information CPU retrieval data Zürich 75
76 Binary search steps (with blocking) ETH computer information CPU Two extra "memory seeks" retrieval data Zürich 76
77 Binary search steps (with blocking) ETH computer information CPU retrieval data Three extra "memory seeks" Zürich 77
78 Binary search steps (with blocking) ETH computer information CPU retrieval data Zürich Average: avg(0,1,2,3,1,2,3) =
79 Approach 4: Front coding 8automata8automate9automatic10automation Only every k terms k 4 bytes 4 bytes bytes (+9 bytes) 79
80 Approach 4: Front coding 8automat*a8 e9 ic10 ion Only every k terms k 4 bytes 4 bytes bytes (less bytes) 80
81 How did we do? Collection: 960 MB Source: Information Retrieval book 81
82 How did we do? Fixed Width 11.2 MB Collection: 960 MB Source: Information Retrieval book 82
83 How did we do? Fixed Width 11.2 MB Unique string and pointers 7.6 MB Collection: 960 MB Source: Information Retrieval book 83
84 How did we do? Fixed Width 11.2 MB Unique string and pointers 7.6 MB Blocking (k=4) 7.1 MB Collection: 960 MB Source: Information Retrieval book 84
85 How did we do? Fixed Width 11.2 MB Unique string and pointers 7.6 MB Blocking (k=4) 7.1 MB Blocking and front coding 5.9 MB Collection: 960 MB Source: Information Retrieval book 85
86 Postings file compression 86
87 Standard inverted index ETH Zürich computer data CPU information retrieval
88 Standard inverted index ETH Zürich computer data CPU information retrieval We compressed this... 88
89 Standard inverted index ETH Zürich computer data CPU information retrieval Now, we want to compress this. 89
90 Standard inverted index In other words, we want to compress lists of integers 90
91 Standard storage bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes 91
92 Standard storage bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes (4 bytes = 32 bits) 92
93 Standard storage bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes (4 bytes = 32 bits) Numbers between 0 and 4,294,967,296 93
94 Encoding gaps bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes (4 bytes = 32 bits) Can we encode with less space? 94
95 Encoding gaps
96 Encoding gaps
97 Encoding gaps These are small gaps!
98 Encoding gaps
99 Encoding gaps
100 Encoding gaps
101 Encoding gaps But this only works for frequent terms! 101
102 Encoding gaps Can we have variable gap size? 102
103 Variable byte encoding 103
104 Fix-length encoding bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes We know exactly where the boundaries are. 104
105 Fix-length encoding bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes We know exactly where the boundaries are
106 Fix-length encoding bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes We know exactly where the boundaries are. 32 bits
107 Fix-length encoding bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes We know exactly where the boundaries are. Stop! 32 bits
108 Fix-length encoding bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes We know exactly where the boundaries are. Stop! 32 bits 32 bits
109 Variable length encodings bytes 2 bytes 4 bytes 3 bytes 5 bytes 4 bytes We do not know a priori where the boundaries are
110 Variable length codings bytes 2 bytes 4 bytes 3 bytes 5 bytes 4 bytes We do not know a priori where the boundaries are.? x bits
111 Prefix codes x bits we can deduce from the bits when to stop
112 Prefix codes: phone numbers Example Internally
113 Prefix codes: phone numbers Example Internally
114 Prefix codes: phone numbers Example Internally
115 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) 115
116 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U
117 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits 117
118 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits 118
119 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits π U+03C
120 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits π U+03C Less than 11 bits 120
121 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits π U+03C Less than 11 bits 121
122 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits π U+03C Less than 11 bits U+20AC
123 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits π U+03C Less than 11 bits U+20AC Less than 16 bits 123
124 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits π U+03C Less than 11 bits U+20AC Less than 16 bits 124
125 Variable byte encoding (here with 8 bit packets)
126 Variable byte encoding (here with 8 bit packets) continuation bit encoding on n-1 bits (here 7) 126
127 Case 1: less than 7 bits required 4 (100) 127
128 Case 1: less than 7 bits required 4 ( ) 128
129 Case 1: less than 7 bits required 4 ( )
130 Case 1: less than 7 bits required 4 ( ) = ends here 130
131 Case 2: Between 8 and 14 bits required 270 ( ) 131
132 Case 2: Between 8 and 14 bits required 270 ( ) 132
133 Case 2: Between 8 and 14 bits required 270 ( )
134 Case 2: Between 8 and 14 bits required 270 ( ) = doesn't end here = ends here 134
135 And so on and so forth
136 And so on and so forth
137 And so on and so forth = doesn't end here 0 = ends here 137
138 Variable byte encoding: example with 4 bit packets decimal 0 binary 0 variable byte encoding
139 Variable byte encoding: example with 4 bit packets decimal binary variable byte encoding
140 Variable byte encoding: example with 4 bit packets decimal binary variable byte encoding
141 Variable byte encoding: example with 4 bit packets decimal binary variable byte encoding
142 Variable byte encoding: example with 4 bit packets fits on 3 bits fits on 6 bits decimal binary variable byte encoding
143 Variable byte encoding: example with 4 bit packets fits on 3 bits fits on 6 bits decimal binary variable byte encoding
144 Variable byte encoding: example with 4 bit packets fits on 3 bits fits on 6 bits decimal binary variable byte encoding % less space
145 Variable byte encoding is a parameterized encoding xx xxxx xxxxxxxx xxxxxxxxxxxxxxxx n=2 n=4 n=8 n=16 145
146 Example (here, 8 bits)
147 Example (here, 8 bits)
148 Example (here, 8 bits)
149 Example (here, 8 bits)
150 Example (here, 8 bits)
151 Example (here, 8 bits) ,
152 No free lunch 152
153 Compromise for variable byte encoding Big packets Little compression Little overhead 153
154 Compromise for variable byte encoding Big packets Small packets Little compression Much compression Little overhead Lot of bits to manipulate 154
155 Can we compress even more? 155
156 Can we compress even more? bitwise? 156
157 Gamma encoding 157
158 Peter Elias
159 Unary code
160 Unary code ones 160
161 Unary code and a zero to mark the stop 161
162 First integers in unary code integer length (unary)
163 Example (here, 8 bits)
164 Example (here, 8 bits)
165 Example (here, 8 bits)
166 Example (here, 8 bits)
167 Gamma encoding: example
168 Gamma encoding: example 19 binary
169 Gamma encoding: example 19 binary
170 Gamma encoding: example 19 binary
171 Gamma encoding: example 19 binary Length in unary
172 Gamma encoding: example 19 binary Length in unary
173 Gamma encoding on the first integers decimal
174 Gamma encoding on the first integers decimal binary
175 Gamma encoding on the first integers decimal binary binary without leading
176 Gamma encoding on the first integers decimal binary binary without leading length
177 Gamma encoding on the first integers decimal binary binary without leading length length (unary)
178 Gamma encoding on the first integers decimal binary binary without leading length length (unary) gamma code
179 Gamma encoding on the first integers decimal binary binary without leading length length (unary) gamma code
180 180 Gamma encoding on the first integers length decimal binary length (unary) binary without leading gamma code
181 181 Gamma encoding on the first integers length decimal binary length (unary) binary without leading gamma code
182 Gamma encoding properties Variable length encoding 182
183 Gamma encoding properties Variable length encoding Prefix encoding 183
184 Gamma encoding properties Variable length encoding Prefix encoding Universal encoding 184
185 Shannon Entropy H(X) =E[I(X)] 185
186 Shannon Entropy H(X) =E[I(X)] "Amount of information" = number of bits 186
187 Shannon Entropy I(p) H(X) =E[I(X)] "Amount of information" = number of bits 0 1 p 187
188 Shannon Entropy I(p) H(X) =E[ log 2 (p X (X))] "Amount of information" = number of bits 0 1 p 188
189 Shannon Entropy H(X) =E[I(X)] = X x2x( ) p X (x) log 2 p X (x) 189
190 Shannon Entropy H(X) =E[I(X)] = X x2x( ) p X (x) log 2 p X (x) H(p) = 0 H(p) = log n 190
191 Expected length of gamma encoding E[L (X)] apple 3H(X) =3E[I(X)] one factor from optimal! 191
192 Expected length of gamma encoding E[L (X)] apple 2H(X)+1=2E[I(X)] + 1 one factor from optimal! 192
193 How much can we compress the inverted index? 193
194 Zipf's law Frequency = k Rank
195 Zipf's law (renormalized) Renormalized frequency = c Rank 195
196 Zipf's law (renormalized) Renormalized frequency = c Rank i=m X i=1 c Rank =1 196
197 Zipf's law Number of occurrences per document = Document length c Rank 197
198 Zipf's law Number of occurrences per document = Lc Rank 198
199 Zipf's law Number of postings = Number of documents Number of occurrences per documents 199
200 Zipf's law Number of postings = NLc Rank 200
201 Zipf's law Blocks with Lc terms Number of postings = NLc Rank 201
202 Zipf's law Rank = Lc Rank = 2Lc Rank = 3Lc Number of postings = NLc Rank 202
203 Zipf's law Approximations Rank = Lc Rank = 2Lc Rank = 3Lc Number of postings = NLc Rank 203
204 Zipf's law N postings Rank = Lc Rank = 2Lc Rank = 3Lc Number of postings = NLc Rank 204
205 Zipf's law N postings N 2 postings Rank = Lc Rank = 2Lc Rank = 3Lc Number of postings = NLc Rank 205
206 Zipf's law N postings N 2 postings N 3 postings Rank = Lc Rank = 2Lc Rank = 3Lc Number of postings = NLc Rank 206
207 Zipf's law Approximations N postings N 2 postings N 3 postings Rank = Lc Rank = 2Lc Rank = 3Lc Number of postings = NLc Rank 207
208 Zipf's law N/j postings 208
209 Zipf's law gap = j N/j postings 209
210 Zipf's law N postings gap =1 N 2 postings N 3 postings gap = 2 gap = 3 Rank = Lc Rank = 2Lc Rank = 3Lc Number of postings = Number of documents Lc Rank 210
211 Zipf's law N j postings gap = j Rank = (j-1)lc Rank = jlc 211
212 Zipf's law N j postings gap = j Rank = (j-1)lc Rank = jlc #bits per term N j (2 log 2(j) + 1) 212
213 Zipf's law N j postings gap = j Rank = (j-1)lc Rank = jlc #bits per term block NLc j (2 log 2 (j) + 1) 213
214 Zipf's law #bits j= X M Lc j=1 NLc j (2 log 2 (j) + 1) 214
215 Zipf's law #bits j= X M Lc j=1 2NLclog 2 (j) j 215
216 How did we do? Collection: 960 MB 216
217 How did we do? Uncompressed on 32 bits 400 MB Collection: 960 MB 217
218 How did we do? Uncompressed on 32 bits 400 MB Uncompressed on 20 bits 250 MB Collection: 960 MB 218
219 How did we do? Uncompressed on 32 bits 400 MB Uncompressed on 20 bits 250 MB Variable byte encoding (gaps) 116 MB Collection: 960 MB 219
220 How did we do? Uncompressed on 32 bits 400 MB Uncompressed on 20 bits 250 MB Variable byte encoding (gaps) 116 MB Elias γ encoding (gaps) 101 MB Collection: 960 MB 220
221 Credits This week: Chapter 5 221
Recap: lecture 2 CS276A Information Retrieval
Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 Information Retrieval Lecture 6: Index Compression 6 Last Time: index construction Sort- based indexing Blocked Sort- Based Indexing Merge sort is effective
More informationIndex Compression. David Kauchak cs160 Fall 2009 adapted from:
Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?
More informationAdministrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks
Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview
More informationCourse work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?
Course work Introduc)on to Informa(on Retrieval Problem set 1 due Thursday Programming exercise 1 will be handed out today CS276: Informa)on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan
More informationWeb Information Retrieval. Lecture 4 Dictionaries, Index Compression
Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data
More informationCS60092: Informa0on Retrieval
Introduc)on to CS60092: Informa0on Retrieval Sourangshu Bha1acharya Last lecture index construc)on Sort- based indexing Naïve in- memory inversion Blocked Sort- Based Indexing Merge sort is effec)ve for
More informationInformation Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007
Information Retrieval Lecture 3 - Index compression Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Dictionary and inverted index:
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationInformation Retrieval 12. Wrap-Up
Ghislain Fourny Information Retrieval 12. Wrap-Up Picture copyright: johan2011/123rf Stock Photo Lecture Overview Introduction Boolean queries Term vocabulary and posting lists Tolerant retrieval Evaluation
More informationCS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University
CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression
More informationLecture 3 Index Construction and Compression. Many thanks to Prabhakar Raghavan for sharing most content from the following slides
Lecture 3 Index Construction and Compression Many thanks to Prabhakar Raghavan for sharing most content from the following slides Recap of the previous lecture Tokenization Term equivalence Skip pointers
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep
More informationIndexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More information1 o Semestre 2007/2008
Efficient Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 6 7 Outline 1 2 3 4 5 6 7 Text es An index is a mechanism to locate a given term in
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo
More informationINDEX CONSTRUCTION 1
1 INDEX CONSTRUCTION PLAN Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden This time: mo among amortize Index construction on
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:
More informationV.2 Index Compression
V.2 Index Compression Heap s law (empirically observed and postulated): Size of the vocabulary (distinct terms) in a corpus E[ distinct terms in corpus] n with total number of term occurrences n, and constants,
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:
More informationIN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)
IN4325 Indexing and query processing Claudia Hauff (WIS, TU Delft) The big picture Information need Topic the user wants to know more about The essence of IR Query Translation of need into an input for
More informationInverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5
Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the
More informationInformation Retrieval. Chap 7. Text Operations
Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing
More informationCOSC431 IR. Compression. Richard A. O'Keefe
COSC431 IR Compression Richard A. O'Keefe Shannon/Barnard Entropy = sum p(c).log 2 (p(c)), taken over characters c Measured in bits, is a limit on how many bits per character an encoding would need. Shannon
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval 1 Outline Dictionaries Wildcard queries skip Edit distance skip Spelling correction skip Soundex 2 Inverted index Our
More informationIndex Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search
Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationFRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.
Sec. 5.2 FRONT CODING Front-coding: Sorted words commonly have long common prefix store differences only (for last k-1 in a block of k) 8automata8automate9automatic10automation 8automat*a1 e2 ic3 ion Encodes
More informationCorso di Biblioteche Digitali
Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto
More informationCS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University
CS6200 Informa.on Retrieval David Smith College of Computer and Informa.on Science Northeastern University Indexing Process Indexes Indexes are data structures designed to make search faster Text search
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant
More informationwhitepaper RediSearch: A High Performance Search Engine as a Redis Module
whitepaper RediSearch: A High Performance Search Engine as a Redis Module Author: Dvir Volk, Senior Architect, Redis Labs Table of Contents RediSearch At-a-Glance 2 A Little Taste: RediSearch in Action
More information信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Last week We have discussed about:
More informationOutline of the course
Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationBuilding an Inverted Index
Building an Inverted Index Algorithms Memory-based Disk-based (Sort-Inversion) Sorting Merging (2-way; multi-way) 2 Memory-based Inverted Index Phase I (parse and read) For each document Identify distinct
More informationCOMPSCI 650 Applied Information Theory Feb 2, Lecture 5. Recall the example of Huffman Coding on a binary string from last class:
COMPSCI 650 Applied Information Theory Feb, 016 Lecture 5 Instructor: Arya Mazumdar Scribe: Larkin Flodin, John Lalor 1 Huffman Coding 1.1 Last Class s Example Recall the example of Huffman Coding on a
More informationIndexing and Query Processing. What will we cover?
Indexing and Query Processing CS 510 Winter 2007 1 What will we cover? Key concepts and terminology Inverted index structures Organization, creation, maintenance Compression Distribution Answering queries
More informationSample questions with solutions Ekaterina Kochmar
Sample questions with solutions Ekaterina Kochmar May 27, 2017 Question 1 Suppose there is a movie rating website where User 1, User 2 and User 3 post their reviews. User 1 has written 30 positive (5-star
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many
More informationEfficiency vs. Effectiveness in Terabyte-Scale IR
Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system
More informationGUJARAT TECHNOLOGICAL UNIVERSITY
GUJARAT TECHNOLOGICAL UNIVERSITY INFORMATION TECHNOLOGY DATA COMPRESSION AND DATA RETRIVAL SUBJECT CODE: 2161603 B.E. 6 th SEMESTER Type of course: Core Prerequisite: None Rationale: Data compression refers
More informationSuccinct Data Structures: Theory and Practice
Succinct Data Structures: Theory and Practice March 16, 2012 Succinct Data Structures: Theory and Practice 1/15 Contents 1 Motivation and Context Memory Hierarchy Succinct Data Structures Basics Succinct
More informationCISC689/ Information Retrieval Midterm Exam
CISC689/489-010 Information Retrieval Midterm Exam You have 2 hours to complete the following four questions. You may use notes and slides. You can use a calculator, but nothing that connects to the internet
More informationExam IST 441 Spring 2014
Exam IST 441 Spring 2014 Last name: Student ID: First name: I acknowledge and accept the University Policies and the Course Policies on Academic Integrity This 100 point exam determines 30% of your grade.
More informationCompressing Integers for Fast File Access
Compressing Integers for Fast File Access Hugh E. Williams Justin Zobel Benjamin Tripp COSI 175a: Data Compression October 23, 2006 Introduction Many data processing applications depend on access to integer
More information3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with
More informationCOMP6237 Data Mining Searching and Ranking
COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001
More informationCC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan
CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 6: Information Retrieval I Aidan Hogan aidhog@gmail.com Postponing MANAGING TEXT DATA Information Overload If we didn t have search Contains all
More informationMG4J: Managing Gigabytes for Java. MG4J - intro 1
MG4J: Managing Gigabytes for Java MG4J - intro 1 Managing Gigabytes for Java Schedule: 1. Introduction to MG4J framework. 2. Exercitation: try to set up a search engine on a particular collection of documents.
More informationIndexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton
Indexing Index Construction CS6200: Information Retrieval Slides by: Jesse Anderton Motivation: Scale Corpus Terms Docs Entries A term incidence matrix with V terms and D documents has O(V x D) entries.
More informationSearch Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson
Search Engines Informa1on Retrieval in Prac1ce Annotations by Michael L. Nelson All slides Addison Wesley, 2008 Indexes Indexes are data structures designed to make search faster Text search has unique
More information15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2
15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2 Name Section Directions: Answer each question neatly in the space provided. Please read each question carefully. You have 50 minutes for this exam. No electronic
More informationNatural Language Processing Basics. Yingyu Liang University of Wisconsin-Madison
Natural Language Processing Basics Yingyu Liang University of Wisconsin-Madison Natural language Processing (NLP) The processing of the human languages by computers One of the oldest AI tasks One of the
More informationCS347. Lecture 2 April 9, Prabhakar Raghavan
CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Today s topics Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card
More informationToday s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan
Today s topics CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards This time: Spell correction Soundex Index construction Index
More informationM1 Computers and Data
M1 Computers and Data Module Outline Architecture vs. Organization. Computer system and its submodules. Concept of frequency. Processor performance equation. Representation of information characters, signed
More informationQuery Answering Using Inverted Indexes
Query Answering Using Inverted Indexes Inverted Indexes Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 2 Document-at-a-time Evaluation
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 7: Scores in a Complete Search System Paul Ginsparg Cornell University, Ithaca,
More informationModern Information Retrieval
Modern Information Retrieval Chapter 9 Indexing and Searching with Gonzalo Navarro Introduction Inverted Indexes Signature Files Suffix Trees and Suffix Arrays Sequential Searching Multi-dimensional Indexing
More informationText Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Indexing (2) Instructor: Walid Magdy 03-Oct-2018 Lecture Objectives Learn more about indexing: Structured documents Extent index Index compression Data structure
More informationQuery Processing and Alternative Search Structures. Indexing common words
Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such
More informationText Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Indexing (2) Instructor: Walid Magdy 10-Oct-2017 Lecture Objectives Learn more about indexing: Structured documents Extent index Index compression Data structure
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationBig Data for Engineers Spring Resource Management
Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models
More informationIntroduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures
More informationMILC: Inverted List Compression in Memory
MILC: Inverted List Compression in Memory Yorrick Müller Garching, 3rd December 2018 Yorrick Müller MILC: Inverted List Compression In Memory 1 Introduction Inverted Lists Inverted list := Series of sorted
More informationmodern database systems lecture 4 : information retrieval
modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation
More informationLossless Compression Algorithms
Multimedia Data Compression Part I Chapter 7 Lossless Compression Algorithms 1 Chapter 7 Lossless Compression Algorithms 1. Introduction 2. Basics of Information Theory 3. Lossless Compression Algorithms
More informationInformation Retrieval
Information Retrieval Data Processing and Storage Ilya Markov i.markov@uva.nl University of Amsterdam Ilya Markov i.markov@uva.nl Information Retrieval 1 Course overview Offline Data Acquisition Data Processing
More informationNatural Language Processing
Natural Language Processing Language Models Language models are distributions over sentences N gram models are built from local conditional probabilities Language Modeling II Dan Klein UC Berkeley, The
More informationMidterm Exam Search Engines ( / ) October 20, 2015
Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Hamid Rastegari Lecture 4: Index Construction Plan Last lecture: Dictionary data structures
More informationIndexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table
Indexing Web pages Web Search: Indexing Web Pages CPS 296.1 Topics in Database Systems Indexing the link structure AltaVista Connectivity Server case study Bharat et al., The Fast Access to Linkage Information
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 06 Scoring, Term Weighting and the Vector Space Model 1 Recap of lecture 5 Collection and vocabulary statistics: Heaps and Zipf s laws Dictionary
More informationStanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.
Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term. Question 1: (4 points) Shown below is a portion of the positional index in the format term: doc1: position1,position2
More informationComparative Analysis of Sparse Matrix Algorithms For Information Retrieval
Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval Nazli Goharian, Ankit Jain, Qian Sun Information Retrieval Laboratory Illinois Institute of Technology Chicago, Illinois {goharian,ajain,qian@ir.iit.edu}
More informationArchitecture and Implementation of Database Systems (Summer 2018)
Jens Teubner Architecture & Implementation of DBMS Summer 2018 1 Architecture and Implementation of Database Systems (Summer 2018) Jens Teubner, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2018 Jens
More informationRecap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval
Ch. 2 Recap of the previous lecture Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval The type/token distinction Terms are normalized types put in the dictionary Tokenization
More informationGhislain Fourny. Big Data 5. Wide column stores
Ghislain Fourny Big Data 5. Wide column stores Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage 2 Where we are User interfaces
More informationBoolean Queries. Keywords combined with Boolean operators:
Query Languages 1 Boolean Queries Keywords combined with Boolean operators: OR: (e 1 OR e 2 ) AND: (e 1 AND e 2 ) BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 4: Indexing April 27, 2010 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Recap: Inverted Indexes
More informationLogistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search
CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects
More informationCOMPUTING SUBJECT KNOWLEDGE AUDIT
COMPUTING SUBJECT KNOWLEDGE AUDIT Use this needs analysis to help self-assess and track your computing subject knowledge. Topic Area 1 Computational thinking Define, explain and use these concepts with
More informationData Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression
An overview of Compression Multimedia Systems and Applications Data Compression Compression becomes necessary in multimedia because it requires large amounts of storage space and bandwidth Types of Compression
More informationLecture 3: Phrasal queries and wildcards
Lecture 3: Phrasal queries and wildcards Trevor Cohn (tcohn@unimelb.edu.au) COMP90042, 2015, Semester 1 What we ll learn today Building on the boolean index and query mechanism to support multi-word queries
More informationCompression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:
CS231 Algorithms Handout # 31 Prof. Lyn Turbak November 20, 2001 Wellesley College Compression The Big Picture We want to be able to store and retrieve data, as well as communicate it with others. In general,
More informationVariable Length Integers for Search
7:57:57 AM Variable Length Integers for Search Past, Present and Future Ryan Ernst A9.com 7:57:59 AM Overview About me Search and inverted indices Traditional encoding (Vbyte) Modern encodings Future work
More informationCS-245 Database System Principles
CS-245 Database System Principles Midterm Exam Summer 2001 SOLUIONS his exam is open book and notes. here are a total of 110 points. You have 110 minutes to complete it. Print your name: he Honor Code
More information2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response
CMSC 476/676 Review 1. Week 1 Overview of Information Retrieval a. Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.
More informationAbdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013
Abdullah-Al Mamun CSE 5095 Yufeng Wu Spring 2013 Introduction Data compression is the art of reducing the number of bits needed to store or transmit data Compression is closely related to decompression
More informationProgramming II (CS300)
1 Programming II (CS300) Chapter 12: Sorting Algorithms MOUNA KACEM mouna@cs.wisc.edu Spring 2018 Outline 2 Last week Implementation of the three tree depth-traversal algorithms Implementation of the BinarySearchTree
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I
Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2015 Quiz I There are 12 questions and 13 pages in this quiz booklet. To receive
More informationBig Data 7. Resource Management
Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage
More informationCC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval I. Aidan Hogan
CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2016 Lecture 7: Information Retrieval I Aidan Hogan aidhog@gmail.com MANAGING TEXT DATA Information Overload If we didn t have search Contains all books with
More information