Information Retrieval 6. Index compression

Size: px
Start display at page:

Download "Information Retrieval 6. Index compression"

Transcription

1 Ghislain Fourny Information Retrieval 6. Index compression Picture copyright: donest /123RF Stock Photo

2 What we have seen so far 2

3 Boolean retrieval lawyer AND Penang AND NOT silver query Input Set of documents Output Subset of documents 3

4 Standard inverted index ETH Zürich computer data CPU information retrieval

5 Search structures Hash tables Trees (B, B+) 5

6 Additional features comput* cmputer Wildcards Spell correction "Pfäffikon SZ" Phrase search 6

7 Bi-word indices (Phrase search feature) Help ETH Zurich to flexibly react to new challenges and to set new accents in the future. Index Help ETH ETH Zurich Zurich to to flexibly flexibly react react to 7

8 Positional index (phrase search feature) "ETH Zurich" Help C,1: 1 ETH C,1: 2 Zurich C,1: 3 to C,3: 4, 7, 11 flexibly C,1: 5 react C,1: 6 8

9 Trigram index (wildcard, spell correction) mpu com ran ter $co $te an$ com er$ err mpu omp put ran rra ter ute computer terran terran computer computer computer terran terran terran computer computer computer computer terran 9

10 TermIDs t1 t2 t3 t4 t5 t6 t7 1 t1 2 t1 3 3 t2 4 t2 7 1 t3 2 t3 4 1 t4 3 t4 5 2 t5 3 t5 4 1 t6 2 t6 4 3 t7 5 t

11 Blocked Sort-Based Indexing 11

12 Single-Pass In-Memory Indexing 12

13 Auxiliary Index ETH Computer Information Course Auxiliary index ETH Computer Information 5 4 Main index Course

14 Logarithmic Merging n postings 2n postings 4n postings Z 0 Z 1 I 0 I 2 14

15 Term Statistics 15

16 Number of terms 16

17 Number of terms 17

18 Notations used in the book N: number of documents T: number of tokens (non-positional postings) M: number of terms (or types if stemming/lemmatization) 18

19 Number of terms # Terms? # Tokens 19

20 Number of terms # Terms? # Tokens 20

21 Number of terms # Terms? # Tokens 21

22 Number of terms # Terms? # Tokens 22

23 Number of terms # Terms? max # Tokens 23

24 Number of terms # Terms We when it's linear # Tokens 24

25 Log-log scale (M) log # Terms We when it's linear log # Tokens (T) 25

26 Log-log scale (M) log # Terms log M = b log T + a We when it's linear log # Tokens (T) 26

27 "Exponential" growth (M) # Terms M = e a T b # Tokens (T) 27

28 Heaps' law (M) # Terms M = kt b # Tokens (T) 28

29 In practice (M) # Terms M = kt b b 1 2 # Tokens (T) 29

30 In practice (M) # Terms M = k p T # Tokens (T) 30

31 In practice (M) # Terms M = k p T 30 apple k apple 100 # Tokens (T) 31

32 Distribution of terms 32

33 Distribution of terms the: 56,271,872 were: 3,323,884 nearer: 51,456 moderate: 19,245 champion: 9400 stocks: 6,537 parallelogram: 503 pachyderm: 79 capacitance: 45 germanium: 12 sesquipedal: 7 33

34 Distribution of terms # Tokens the of and to in I was Rank 34

35 35 Distribution of terms # Tokens Rank

36 log-log scale log # Tokens log Rank 36

37 Zipf's law log Frequency = a log Rank + b 37

38 Zipf's law log Frequency = a log Rank + b log # Tokens log Rank 38

39 Zipf's law log # Tokens log Frequency = b log Rank log Rank 39

40 Zipf's law Frequency = k Rank

41 Compression techniques already covered 41

42 Compression techniques already covered Remove numbers 42

43 Compression techniques already covered Remove numbers Apple apple Case folding 43

44 Compression techniques already covered Remove numbers Apple apple Case folding and of the Remove stopwords 44

45 Compression techniques already covered Remove numbers Apple apple Case folding and of the Remove stopwords computing compute Stemming 45

46 Compression techniques already covered Remove numbers Apple apple and of the Case folding Remove stopwords This reduces the size of the dictionary! computing compute Stemming 46

47 Impact (number of terms/types) Remove numbers -2% Apple apple Case folding -17% -33% and of the Remove stopwords -0% computing compute Stemming -17% Source: Information Retrieval book 47

48 Impact (number of postings) Remove numbers -8% Apple apple Case folding -3% -42% and of the Remove stopwords -30% computing compute Stemming -4% Source: Information Retrieval book 48

49 Impact (number of tokens) Remove numbers -9% Apple apple Case folding -0% -52% and of the Remove stopwords -47% computing compute Stemming -0% Source: Information Retrieval book 49

50 Dictionary compression 50

51 Standard inverted index ETH Zürich computer data CPU information retrieval

52 Standard inverted index ETH Zürich computer data CPU information retrieval Let us start compressing the dictionary. 52

53 Status quo 53

54 Status quo: Dictionary stored as a B+ tree possess come is merely that thy upon almost be carefully is it Laertes possess should take thy time to come fair hour merely most my that thine this upon you your 54

55 Status quo: Dictionary stored as a B+ tree possess come is merely that thy upon almost be carefully is it Laertes possess should take thy time to come fair hour merely most my that thine this upon you your Pointers to postings lists 55

56 Status quo: Dictionary stored as a B+ tree possess come is merely that thy upon almost be carefully is it Laertes possess should take thy time to come fair hour merely most my that thine this upon you your Pointers to postings lists 56

57 Standard inverted index ETH Zürich computer data CPU information retrieval Let us start compressing the dictionary. 57

58 Standard inverted index ETH Zürich computer data CPU information retrieval We can then make it fit in RAM. 58

59 Approach 1: Array computer... CPU... data... ETH... information.. retrieval... Zürich

60 Approach 1: Array computer... CPU... data... ETH... information.. retrieval... Zürich bytes 4 bytes 4 bytes 60

61 Approach 1: Issue computer... CPU... data... ETH... information.. retrieval zupercalifragilisticexpialidocious 61

62 Approach 2: String computercpudataethinformationretrievalzürich 62

63 Approach 2: String computercpudataethinformationretrievalzürich bytes 4 bytes 63

64 Approach 2: String computercpudataethinformationretrievalzürich bytes 4 bytes 3 bytes 64

65 Approach 2: String computercpudataethinformationretrievalzürich bytes 4 bytes 3 bytes (+8 bytes) 65

66 Approach 3: Blocked storage 8computer3CPU4data3ETH11information9retrieval Only every k terms k 4 bytes 4 bytes bytes (+9 bytes) 66

67 No free lunch 67

68 No free lunch 68

69 No free lunch 69

70 Compromise between space and time 70

71 Binary search steps (no blocking) ETH CPU retrieval computer data information Zürich 71

72 Binary search steps (no blocking) ETH CPU One extra "memory seek" retrieval computer data information Zürich 72

73 Binary search steps (no blocking) ETH CPU retrieval computer data Two extra "memory seeks" information Zürich 73

74 Binary search steps (no blocking) ETH CPU retrieval computer data information Zürich Average: avg(0,1,2,2,1,2,2) =

75 Binary search steps (with blocking) ETH computer information CPU retrieval data Zürich 75

76 Binary search steps (with blocking) ETH computer information CPU Two extra "memory seeks" retrieval data Zürich 76

77 Binary search steps (with blocking) ETH computer information CPU retrieval data Three extra "memory seeks" Zürich 77

78 Binary search steps (with blocking) ETH computer information CPU retrieval data Zürich Average: avg(0,1,2,3,1,2,3) =

79 Approach 4: Front coding 8automata8automate9automatic10automation Only every k terms k 4 bytes 4 bytes bytes (+9 bytes) 79

80 Approach 4: Front coding 8automat*a8 e9 ic10 ion Only every k terms k 4 bytes 4 bytes bytes (less bytes) 80

81 How did we do? Collection: 960 MB Source: Information Retrieval book 81

82 How did we do? Fixed Width 11.2 MB Collection: 960 MB Source: Information Retrieval book 82

83 How did we do? Fixed Width 11.2 MB Unique string and pointers 7.6 MB Collection: 960 MB Source: Information Retrieval book 83

84 How did we do? Fixed Width 11.2 MB Unique string and pointers 7.6 MB Blocking (k=4) 7.1 MB Collection: 960 MB Source: Information Retrieval book 84

85 How did we do? Fixed Width 11.2 MB Unique string and pointers 7.6 MB Blocking (k=4) 7.1 MB Blocking and front coding 5.9 MB Collection: 960 MB Source: Information Retrieval book 85

86 Postings file compression 86

87 Standard inverted index ETH Zürich computer data CPU information retrieval

88 Standard inverted index ETH Zürich computer data CPU information retrieval We compressed this... 88

89 Standard inverted index ETH Zürich computer data CPU information retrieval Now, we want to compress this. 89

90 Standard inverted index In other words, we want to compress lists of integers 90

91 Standard storage bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes 91

92 Standard storage bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes (4 bytes = 32 bits) 92

93 Standard storage bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes (4 bytes = 32 bits) Numbers between 0 and 4,294,967,296 93

94 Encoding gaps bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes (4 bytes = 32 bits) Can we encode with less space? 94

95 Encoding gaps

96 Encoding gaps

97 Encoding gaps These are small gaps!

98 Encoding gaps

99 Encoding gaps

100 Encoding gaps

101 Encoding gaps But this only works for frequent terms! 101

102 Encoding gaps Can we have variable gap size? 102

103 Variable byte encoding 103

104 Fix-length encoding bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes We know exactly where the boundaries are. 104

105 Fix-length encoding bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes We know exactly where the boundaries are

106 Fix-length encoding bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes We know exactly where the boundaries are. 32 bits

107 Fix-length encoding bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes We know exactly where the boundaries are. Stop! 32 bits

108 Fix-length encoding bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes We know exactly where the boundaries are. Stop! 32 bits 32 bits

109 Variable length encodings bytes 2 bytes 4 bytes 3 bytes 5 bytes 4 bytes We do not know a priori where the boundaries are

110 Variable length codings bytes 2 bytes 4 bytes 3 bytes 5 bytes 4 bytes We do not know a priori where the boundaries are.? x bits

111 Prefix codes x bits we can deduce from the bits when to stop

112 Prefix codes: phone numbers Example Internally

113 Prefix codes: phone numbers Example Internally

114 Prefix codes: phone numbers Example Internally

115 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) 115

116 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U

117 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits 117

118 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits 118

119 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits π U+03C

120 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits π U+03C Less than 11 bits 120

121 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits π U+03C Less than 11 bits 121

122 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits π U+03C Less than 11 bits U+20AC

123 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits π U+03C Less than 11 bits U+20AC Less than 16 bits 123

124 Prefix codes: UTF-8 Character Codepoint Codepoint in binary UTF-8 (variable length) P U Less than 7 bits π U+03C Less than 11 bits U+20AC Less than 16 bits 124

125 Variable byte encoding (here with 8 bit packets)

126 Variable byte encoding (here with 8 bit packets) continuation bit encoding on n-1 bits (here 7) 126

127 Case 1: less than 7 bits required 4 (100) 127

128 Case 1: less than 7 bits required 4 ( ) 128

129 Case 1: less than 7 bits required 4 ( )

130 Case 1: less than 7 bits required 4 ( ) = ends here 130

131 Case 2: Between 8 and 14 bits required 270 ( ) 131

132 Case 2: Between 8 and 14 bits required 270 ( ) 132

133 Case 2: Between 8 and 14 bits required 270 ( )

134 Case 2: Between 8 and 14 bits required 270 ( ) = doesn't end here = ends here 134

135 And so on and so forth

136 And so on and so forth

137 And so on and so forth = doesn't end here 0 = ends here 137

138 Variable byte encoding: example with 4 bit packets decimal 0 binary 0 variable byte encoding

139 Variable byte encoding: example with 4 bit packets decimal binary variable byte encoding

140 Variable byte encoding: example with 4 bit packets decimal binary variable byte encoding

141 Variable byte encoding: example with 4 bit packets decimal binary variable byte encoding

142 Variable byte encoding: example with 4 bit packets fits on 3 bits fits on 6 bits decimal binary variable byte encoding

143 Variable byte encoding: example with 4 bit packets fits on 3 bits fits on 6 bits decimal binary variable byte encoding

144 Variable byte encoding: example with 4 bit packets fits on 3 bits fits on 6 bits decimal binary variable byte encoding % less space

145 Variable byte encoding is a parameterized encoding xx xxxx xxxxxxxx xxxxxxxxxxxxxxxx n=2 n=4 n=8 n=16 145

146 Example (here, 8 bits)

147 Example (here, 8 bits)

148 Example (here, 8 bits)

149 Example (here, 8 bits)

150 Example (here, 8 bits)

151 Example (here, 8 bits) ,

152 No free lunch 152

153 Compromise for variable byte encoding Big packets Little compression Little overhead 153

154 Compromise for variable byte encoding Big packets Small packets Little compression Much compression Little overhead Lot of bits to manipulate 154

155 Can we compress even more? 155

156 Can we compress even more? bitwise? 156

157 Gamma encoding 157

158 Peter Elias

159 Unary code

160 Unary code ones 160

161 Unary code and a zero to mark the stop 161

162 First integers in unary code integer length (unary)

163 Example (here, 8 bits)

164 Example (here, 8 bits)

165 Example (here, 8 bits)

166 Example (here, 8 bits)

167 Gamma encoding: example

168 Gamma encoding: example 19 binary

169 Gamma encoding: example 19 binary

170 Gamma encoding: example 19 binary

171 Gamma encoding: example 19 binary Length in unary

172 Gamma encoding: example 19 binary Length in unary

173 Gamma encoding on the first integers decimal

174 Gamma encoding on the first integers decimal binary

175 Gamma encoding on the first integers decimal binary binary without leading

176 Gamma encoding on the first integers decimal binary binary without leading length

177 Gamma encoding on the first integers decimal binary binary without leading length length (unary)

178 Gamma encoding on the first integers decimal binary binary without leading length length (unary) gamma code

179 Gamma encoding on the first integers decimal binary binary without leading length length (unary) gamma code

180 180 Gamma encoding on the first integers length decimal binary length (unary) binary without leading gamma code

181 181 Gamma encoding on the first integers length decimal binary length (unary) binary without leading gamma code

182 Gamma encoding properties Variable length encoding 182

183 Gamma encoding properties Variable length encoding Prefix encoding 183

184 Gamma encoding properties Variable length encoding Prefix encoding Universal encoding 184

185 Shannon Entropy H(X) =E[I(X)] 185

186 Shannon Entropy H(X) =E[I(X)] "Amount of information" = number of bits 186

187 Shannon Entropy I(p) H(X) =E[I(X)] "Amount of information" = number of bits 0 1 p 187

188 Shannon Entropy I(p) H(X) =E[ log 2 (p X (X))] "Amount of information" = number of bits 0 1 p 188

189 Shannon Entropy H(X) =E[I(X)] = X x2x( ) p X (x) log 2 p X (x) 189

190 Shannon Entropy H(X) =E[I(X)] = X x2x( ) p X (x) log 2 p X (x) H(p) = 0 H(p) = log n 190

191 Expected length of gamma encoding E[L (X)] apple 3H(X) =3E[I(X)] one factor from optimal! 191

192 Expected length of gamma encoding E[L (X)] apple 2H(X)+1=2E[I(X)] + 1 one factor from optimal! 192

193 How much can we compress the inverted index? 193

194 Zipf's law Frequency = k Rank

195 Zipf's law (renormalized) Renormalized frequency = c Rank 195

196 Zipf's law (renormalized) Renormalized frequency = c Rank i=m X i=1 c Rank =1 196

197 Zipf's law Number of occurrences per document = Document length c Rank 197

198 Zipf's law Number of occurrences per document = Lc Rank 198

199 Zipf's law Number of postings = Number of documents Number of occurrences per documents 199

200 Zipf's law Number of postings = NLc Rank 200

201 Zipf's law Blocks with Lc terms Number of postings = NLc Rank 201

202 Zipf's law Rank = Lc Rank = 2Lc Rank = 3Lc Number of postings = NLc Rank 202

203 Zipf's law Approximations Rank = Lc Rank = 2Lc Rank = 3Lc Number of postings = NLc Rank 203

204 Zipf's law N postings Rank = Lc Rank = 2Lc Rank = 3Lc Number of postings = NLc Rank 204

205 Zipf's law N postings N 2 postings Rank = Lc Rank = 2Lc Rank = 3Lc Number of postings = NLc Rank 205

206 Zipf's law N postings N 2 postings N 3 postings Rank = Lc Rank = 2Lc Rank = 3Lc Number of postings = NLc Rank 206

207 Zipf's law Approximations N postings N 2 postings N 3 postings Rank = Lc Rank = 2Lc Rank = 3Lc Number of postings = NLc Rank 207

208 Zipf's law N/j postings 208

209 Zipf's law gap = j N/j postings 209

210 Zipf's law N postings gap =1 N 2 postings N 3 postings gap = 2 gap = 3 Rank = Lc Rank = 2Lc Rank = 3Lc Number of postings = Number of documents Lc Rank 210

211 Zipf's law N j postings gap = j Rank = (j-1)lc Rank = jlc 211

212 Zipf's law N j postings gap = j Rank = (j-1)lc Rank = jlc #bits per term N j (2 log 2(j) + 1) 212

213 Zipf's law N j postings gap = j Rank = (j-1)lc Rank = jlc #bits per term block NLc j (2 log 2 (j) + 1) 213

214 Zipf's law #bits j= X M Lc j=1 NLc j (2 log 2 (j) + 1) 214

215 Zipf's law #bits j= X M Lc j=1 2NLclog 2 (j) j 215

216 How did we do? Collection: 960 MB 216

217 How did we do? Uncompressed on 32 bits 400 MB Collection: 960 MB 217

218 How did we do? Uncompressed on 32 bits 400 MB Uncompressed on 20 bits 250 MB Collection: 960 MB 218

219 How did we do? Uncompressed on 32 bits 400 MB Uncompressed on 20 bits 250 MB Variable byte encoding (gaps) 116 MB Collection: 960 MB 219

220 How did we do? Uncompressed on 32 bits 400 MB Uncompressed on 20 bits 250 MB Variable byte encoding (gaps) 116 MB Elias γ encoding (gaps) 101 MB Collection: 960 MB 220

221 Credits This week: Chapter 5 221

Recap: lecture 2 CS276A Information Retrieval

Recap: lecture 2 CS276A Information Retrieval Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 Information Retrieval Lecture 6: Index Compression 6 Last Time: index construction Sort- based indexing Blocked Sort- Based Indexing Merge sort is effective

More information

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Index Compression. David Kauchak cs160 Fall 2009 adapted from: Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?

More information

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview

More information

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes? Course work Introduc)on to Informa(on Retrieval Problem set 1 due Thursday Programming exercise 1 will be handed out today CS276: Informa)on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan

More information

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data

More information

CS60092: Informa0on Retrieval

CS60092: Informa0on Retrieval Introduc)on to CS60092: Informa0on Retrieval Sourangshu Bha1acharya Last lecture index construc)on Sort- based indexing Naïve in- memory inversion Blocked Sort- Based Indexing Merge sort is effec)ve for

More information

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007 Information Retrieval Lecture 3 - Index compression Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Dictionary and inverted index:

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Information Retrieval 12. Wrap-Up

Information Retrieval 12. Wrap-Up Ghislain Fourny Information Retrieval 12. Wrap-Up Picture copyright: johan2011/123rf Stock Photo Lecture Overview Introduction Boolean queries Term vocabulary and posting lists Tolerant retrieval Evaluation

More information

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression

More information

Lecture 3 Index Construction and Compression. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Lecture 3 Index Construction and Compression. Many thanks to Prabhakar Raghavan for sharing most content from the following slides Lecture 3 Index Construction and Compression Many thanks to Prabhakar Raghavan for sharing most content from the following slides Recap of the previous lecture Tokenization Term equivalence Skip pointers

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep

More information

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Efficient Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 6 7 Outline 1 2 3 4 5 6 7 Text es An index is a mechanism to locate a given term in

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo

More information

INDEX CONSTRUCTION 1

INDEX CONSTRUCTION 1 1 INDEX CONSTRUCTION PLAN Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden This time: mo among amortize Index construction on

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:

More information

V.2 Index Compression

V.2 Index Compression V.2 Index Compression Heap s law (empirically observed and postulated): Size of the vocabulary (distinct terms) in a corpus E[ distinct terms in corpus] n with total number of term occurrences n, and constants,

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:

More information

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft) IN4325 Indexing and query processing Claudia Hauff (WIS, TU Delft) The big picture Information need Topic the user wants to know more about The essence of IR Query Translation of need into an input for

More information

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the

More information

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing

More information

COSC431 IR. Compression. Richard A. O'Keefe

COSC431 IR. Compression. Richard A. O'Keefe COSC431 IR Compression Richard A. O'Keefe Shannon/Barnard Entropy = sum p(c).log 2 (p(c)), taken over characters c Measured in bits, is a limit on how many bits per character an encoding would need. Shannon

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval 1 Outline Dictionaries Wildcard queries skip Edit distance skip Spelling correction skip Soundex 2 Inverted index Our

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression. Sec. 5.2 FRONT CODING Front-coding: Sorted words commonly have long common prefix store differences only (for last k-1 in a block of k) 8automata8automate9automatic10automation 8automat*a1 e2 ic3 ion Encodes

More information

Corso di Biblioteche Digitali

Corso di Biblioteche Digitali Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto

More information

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University CS6200 Informa.on Retrieval David Smith College of Computer and Informa.on Science Northeastern University Indexing Process Indexes Indexes are data structures designed to make search faster Text search

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant

More information

whitepaper RediSearch: A High Performance Search Engine as a Redis Module

whitepaper RediSearch: A High Performance Search Engine as a Redis Module whitepaper RediSearch: A High Performance Search Engine as a Redis Module Author: Dvir Volk, Senior Architect, Redis Labs Table of Contents RediSearch At-a-Glance 2 A Little Taste: RediSearch in Action

More information

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Last week We have discussed about:

More information

Outline of the course

Outline of the course Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

Building an Inverted Index

Building an Inverted Index Building an Inverted Index Algorithms Memory-based Disk-based (Sort-Inversion) Sorting Merging (2-way; multi-way) 2 Memory-based Inverted Index Phase I (parse and read) For each document Identify distinct

More information

COMPSCI 650 Applied Information Theory Feb 2, Lecture 5. Recall the example of Huffman Coding on a binary string from last class:

COMPSCI 650 Applied Information Theory Feb 2, Lecture 5. Recall the example of Huffman Coding on a binary string from last class: COMPSCI 650 Applied Information Theory Feb, 016 Lecture 5 Instructor: Arya Mazumdar Scribe: Larkin Flodin, John Lalor 1 Huffman Coding 1.1 Last Class s Example Recall the example of Huffman Coding on a

More information

Indexing and Query Processing. What will we cover?

Indexing and Query Processing. What will we cover? Indexing and Query Processing CS 510 Winter 2007 1 What will we cover? Key concepts and terminology Inverted index structures Organization, creation, maintenance Compression Distribution Answering queries

More information

Sample questions with solutions Ekaterina Kochmar

Sample questions with solutions Ekaterina Kochmar Sample questions with solutions Ekaterina Kochmar May 27, 2017 Question 1 Suppose there is a movie rating website where User 1, User 2 and User 3 post their reviews. User 1 has written 30 positive (5-star

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many

More information

Efficiency vs. Effectiveness in Terabyte-Scale IR

Efficiency vs. Effectiveness in Terabyte-Scale IR Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system

More information

GUJARAT TECHNOLOGICAL UNIVERSITY

GUJARAT TECHNOLOGICAL UNIVERSITY GUJARAT TECHNOLOGICAL UNIVERSITY INFORMATION TECHNOLOGY DATA COMPRESSION AND DATA RETRIVAL SUBJECT CODE: 2161603 B.E. 6 th SEMESTER Type of course: Core Prerequisite: None Rationale: Data compression refers

More information

Succinct Data Structures: Theory and Practice

Succinct Data Structures: Theory and Practice Succinct Data Structures: Theory and Practice March 16, 2012 Succinct Data Structures: Theory and Practice 1/15 Contents 1 Motivation and Context Memory Hierarchy Succinct Data Structures Basics Succinct

More information

CISC689/ Information Retrieval Midterm Exam

CISC689/ Information Retrieval Midterm Exam CISC689/489-010 Information Retrieval Midterm Exam You have 2 hours to complete the following four questions. You may use notes and slides. You can use a calculator, but nothing that connects to the internet

More information

Exam IST 441 Spring 2014

Exam IST 441 Spring 2014 Exam IST 441 Spring 2014 Last name: Student ID: First name: I acknowledge and accept the University Policies and the Course Policies on Academic Integrity This 100 point exam determines 30% of your grade.

More information

Compressing Integers for Fast File Access

Compressing Integers for Fast File Access Compressing Integers for Fast File Access Hugh E. Williams Justin Zobel Benjamin Tripp COSI 175a: Data Compression October 23, 2006 Introduction Many data processing applications depend on access to integer

More information

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with

More information

COMP6237 Data Mining Searching and Ranking

COMP6237 Data Mining Searching and Ranking COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001

More information

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 6: Information Retrieval I Aidan Hogan aidhog@gmail.com Postponing MANAGING TEXT DATA Information Overload If we didn t have search Contains all

More information

MG4J: Managing Gigabytes for Java. MG4J - intro 1

MG4J: Managing Gigabytes for Java. MG4J - intro 1 MG4J: Managing Gigabytes for Java MG4J - intro 1 Managing Gigabytes for Java Schedule: 1. Introduction to MG4J framework. 2. Exercitation: try to set up a search engine on a particular collection of documents.

More information

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton Indexing Index Construction CS6200: Information Retrieval Slides by: Jesse Anderton Motivation: Scale Corpus Terms Docs Entries A term incidence matrix with V terms and D documents has O(V x D) entries.

More information

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson Search Engines Informa1on Retrieval in Prac1ce Annotations by Michael L. Nelson All slides Addison Wesley, 2008 Indexes Indexes are data structures designed to make search faster Text search has unique

More information

15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2

15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2 15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2 Name Section Directions: Answer each question neatly in the space provided. Please read each question carefully. You have 50 minutes for this exam. No electronic

More information

Natural Language Processing Basics. Yingyu Liang University of Wisconsin-Madison

Natural Language Processing Basics. Yingyu Liang University of Wisconsin-Madison Natural Language Processing Basics Yingyu Liang University of Wisconsin-Madison Natural language Processing (NLP) The processing of the human languages by computers One of the oldest AI tasks One of the

More information

CS347. Lecture 2 April 9, Prabhakar Raghavan

CS347. Lecture 2 April 9, Prabhakar Raghavan CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Today s topics Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card

More information

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan Today s topics CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards This time: Spell correction Soundex Index construction Index

More information

M1 Computers and Data

M1 Computers and Data M1 Computers and Data Module Outline Architecture vs. Organization. Computer system and its submodules. Concept of frequency. Processor performance equation. Representation of information characters, signed

More information

Query Answering Using Inverted Indexes

Query Answering Using Inverted Indexes Query Answering Using Inverted Indexes Inverted Indexes Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 2 Document-at-a-time Evaluation

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 7: Scores in a Complete Search System Paul Ginsparg Cornell University, Ithaca,

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 9 Indexing and Searching with Gonzalo Navarro Introduction Inverted Indexes Signature Files Suffix Trees and Suffix Arrays Sequential Searching Multi-dimensional Indexing

More information

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Indexing (2) Instructor: Walid Magdy 03-Oct-2018 Lecture Objectives Learn more about indexing: Structured documents Extent index Index compression Data structure

More information

Query Processing and Alternative Search Structures. Indexing common words

Query Processing and Alternative Search Structures. Indexing common words Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such

More information

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Indexing (2) Instructor: Walid Magdy 10-Oct-2017 Lecture Objectives Learn more about indexing: Structured documents Extent index Index compression Data structure

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Big Data for Engineers Spring Resource Management

Big Data for Engineers Spring Resource Management Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models

More information

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures

More information

MILC: Inverted List Compression in Memory

MILC: Inverted List Compression in Memory MILC: Inverted List Compression in Memory Yorrick Müller Garching, 3rd December 2018 Yorrick Müller MILC: Inverted List Compression In Memory 1 Introduction Inverted Lists Inverted list := Series of sorted

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Lossless Compression Algorithms

Lossless Compression Algorithms Multimedia Data Compression Part I Chapter 7 Lossless Compression Algorithms 1 Chapter 7 Lossless Compression Algorithms 1. Introduction 2. Basics of Information Theory 3. Lossless Compression Algorithms

More information

Information Retrieval

Information Retrieval Information Retrieval Data Processing and Storage Ilya Markov i.markov@uva.nl University of Amsterdam Ilya Markov i.markov@uva.nl Information Retrieval 1 Course overview Offline Data Acquisition Data Processing

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Language Models Language models are distributions over sentences N gram models are built from local conditional probabilities Language Modeling II Dan Klein UC Berkeley, The

More information

Midterm Exam Search Engines ( / ) October 20, 2015

Midterm Exam Search Engines ( / ) October 20, 2015 Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Hamid Rastegari Lecture 4: Index Construction Plan Last lecture: Dictionary data structures

More information

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table Indexing Web pages Web Search: Indexing Web Pages CPS 296.1 Topics in Database Systems Indexing the link structure AltaVista Connectivity Server case study Bharat et al., The Fast Access to Linkage Information

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 06 Scoring, Term Weighting and the Vector Space Model 1 Recap of lecture 5 Collection and vocabulary statistics: Heaps and Zipf s laws Dictionary

More information

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term. Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term. Question 1: (4 points) Shown below is a portion of the positional index in the format term: doc1: position1,position2

More information

Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval

Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval Nazli Goharian, Ankit Jain, Qian Sun Information Retrieval Laboratory Illinois Institute of Technology Chicago, Illinois {goharian,ajain,qian@ir.iit.edu}

More information

Architecture and Implementation of Database Systems (Summer 2018)

Architecture and Implementation of Database Systems (Summer 2018) Jens Teubner Architecture & Implementation of DBMS Summer 2018 1 Architecture and Implementation of Database Systems (Summer 2018) Jens Teubner, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2018 Jens

More information

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval Ch. 2 Recap of the previous lecture Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval The type/token distinction Terms are normalized types put in the dictionary Tokenization

More information

Ghislain Fourny. Big Data 5. Wide column stores

Ghislain Fourny. Big Data 5. Wide column stores Ghislain Fourny Big Data 5. Wide column stores Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage 2 Where we are User interfaces

More information

Boolean Queries. Keywords combined with Boolean operators:

Boolean Queries. Keywords combined with Boolean operators: Query Languages 1 Boolean Queries Keywords combined with Boolean operators: OR: (e 1 OR e 2 ) AND: (e 1 AND e 2 ) BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 4: Indexing April 27, 2010 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Recap: Inverted Indexes

More information

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects

More information

COMPUTING SUBJECT KNOWLEDGE AUDIT

COMPUTING SUBJECT KNOWLEDGE AUDIT COMPUTING SUBJECT KNOWLEDGE AUDIT Use this needs analysis to help self-assess and track your computing subject knowledge. Topic Area 1 Computational thinking Define, explain and use these concepts with

More information

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression An overview of Compression Multimedia Systems and Applications Data Compression Compression becomes necessary in multimedia because it requires large amounts of storage space and bandwidth Types of Compression

More information

Lecture 3: Phrasal queries and wildcards

Lecture 3: Phrasal queries and wildcards Lecture 3: Phrasal queries and wildcards Trevor Cohn (tcohn@unimelb.edu.au) COMP90042, 2015, Semester 1 What we ll learn today Building on the boolean index and query mechanism to support multi-word queries

More information

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints: CS231 Algorithms Handout # 31 Prof. Lyn Turbak November 20, 2001 Wellesley College Compression The Big Picture We want to be able to store and retrieve data, as well as communicate it with others. In general,

More information

Variable Length Integers for Search

Variable Length Integers for Search 7:57:57 AM Variable Length Integers for Search Past, Present and Future Ryan Ernst A9.com 7:57:59 AM Overview About me Search and inverted indices Traditional encoding (Vbyte) Modern encodings Future work

More information

CS-245 Database System Principles

CS-245 Database System Principles CS-245 Database System Principles Midterm Exam Summer 2001 SOLUIONS his exam is open book and notes. here are a total of 110 points. You have 110 minutes to complete it. Print your name: he Honor Code

More information

2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response

2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response CMSC 476/676 Review 1. Week 1 Overview of Information Retrieval a. Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.

More information

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013 Abdullah-Al Mamun CSE 5095 Yufeng Wu Spring 2013 Introduction Data compression is the art of reducing the number of bits needed to store or transmit data Compression is closely related to decompression

More information

Programming II (CS300)

Programming II (CS300) 1 Programming II (CS300) Chapter 12: Sorting Algorithms MOUNA KACEM mouna@cs.wisc.edu Spring 2018 Outline 2 Last week Implementation of the three tree depth-traversal algorithms Implementation of the BinarySearchTree

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2015 Quiz I There are 12 questions and 13 pages in this quiz booklet. To receive

More information

Big Data 7. Resource Management

Big Data 7. Resource Management Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage

More information

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval I. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval I. Aidan Hogan CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2016 Lecture 7: Information Retrieval I Aidan Hogan aidhog@gmail.com MANAGING TEXT DATA Information Overload If we didn t have search Contains all books with

More information