Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management Full- Text Indexing
Contents } Introduction } Inverted Indices } Construction } Searching 2 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Sequen$al or online searching } Involves finding the occurrences of a pafern in a text when the text is not preprocessed } Is appropriate when the text is small } Is the only choice if the text collec$on is very vola$le (i.e. undergoes modifica$ons very frequently), or the index space overhead cannot be afforded 3 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Indexed searching Index } data structure over the text to speed up the search } is appropriate when the text collec$on is large and semi- sta$c } Semi- sta$c collec$on: is updated at reasonably regular intervals but is not deemed to support thousands of inser$on of single words per second. Indexing techniques } inverted indices, suffix arrays, and signature files } consider } search cost } space overhead } cost of building and upda$ng indexing structures 4 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Indexing techniques Inverted indices } Word oriented mechanism for indexing a text collec$on } Composed of vocabulary and occurrences } are currently the best choice for most applica$ons Suffix arrays } are faster for phrase searches and other less common queries } are harder to build and maintain Signature files } Word oriented index structures based on hashing } were popular in the 980s 5 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Nota$ons and Background } Nota$ons n: the size of the text database m: pafern length M: amount of main memory available } Background } sorted arrays } binary search tree } B- tree } hash table } trie 6 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Trie } TRIE or PREFIX TREE } An in- memory mul$way tree storing a set of strings } } } Strings are stored in the leaves Every edge of the trie is labelled with a lefer All the descendants of a node have a common prefix } the sequence of lefers from root to the node 6 9 7 9 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from lefers. Strings to be stored and the l lefers:60 corresponding star$ng posi$ons: d made:50 lefers: 60 m a made:50 t n many: 28 text:,9 many:28 w text:, 9 words:33,40 words: 33,40 7 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Trie (cont.) Construc<on } The root of the trie uses the first character } The children of the root uses the second character, and so on } If the remaining subtrie contains only one string, that string s iden$ty is stored in a leaf node Access } Start from the root } Follow the path given by the character sequence of the pafern } Stop when a leaf is found or no character matches with the current one } the access cost is O(m) 8 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Contents } Introduction } Inverted Indices } Construction } Searching 9 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Sec.. Unstructured data in 680 } Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? } One could grep all of Shakespeare s plays for Brutus and Caesar, then strip out lines containing Calpurnia? } Why is that not the answer? } Slow (for large corpora) } NOT Calpurnia is non- trivial } Other opera$ons (e.g., find the word Romans near countrymen) not feasible 0
Sec.. Term- document incidence Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 0 0 0 Brutus 0 0 0 Caesar 0 Calpurnia 0 0 0 0 0 Cleopatra 0 0 0 0 0 mercy 0 worser 0 0 Brutus AND Caesar BUT NOT Calpurnia if play contains word, 0 otherwise
Sec.. Incidence vectors } So we have a 0/ vector for each term. } To answer query: } take the vectors for } Brutus: 000 } Caesar: 0 } Calpurnia (complemented):0 } bitwise AND } 000 AND 0 AND 0 = 0000 } Select the documents corresponding to 2
Sec.. Answers to query } Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. } Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. 3
Bigger collec$on } Consider N = 0 6 documents, each with about 000 tokens total of 0 9 tokens } On average 6 bytes per token, including spaces and punctua$on size of document collec$on is about 6 0 9 = 6 GB } Assume there are M = 500,000 dis$nct terms in the collec$on (Note that we are making a term/token dis$nc$on.) 4 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Can t build the incidence matrix } M = 500,000 0 6 = half a trillion 0s and s. } But the matrix has no more than one billion s. } Matrix is extremely sparse. } What is a befer representa$ons? } We only record the s. 5 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Sec..2 Inverted index } For each term t, we must store a list of all documents that contain t. } Iden$fy each by a docid, a document serial number } Can we use fixed- size arrays for this? Brutus 2 4 3 45 7374 Caesar 2 4 5 6 6 57 32 Calpurnia 2 3 54 0 What happens if the word Caesar is added to document 4? 6
Defini$on of Inverted index } A word- oriented mechanism for indexing a text collec$on in order to speed up the searching task } Two elements } Vocabulary } The set of all different words in the text } Occurrences (Pos<ng lists) } For each word a list of loca$ons where term occurs Document- based: A list of documents with the corresponding term frequency Word- based: A list of documents with the corresponding word posi$ons 7 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Document- based inverted index Index terms computer database df 3 2 D j, tf j D 7, 4 D, 3 science 4 D 2, 4 system D 5, 2 Vocabulary Pos<ng lists 8 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Word- based inverted index Index terms computer database df 3 2 D j, wp,, wp n D 7, 50, 90, 50, 800 D,,00,634 science system 4 D 2, D 5, Vocabulary Pos<ng lists } The posi$on of the term in the document, wp i, facilitates proximity searching 9 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Space requirements } The space required for the vocabulary is rather small } HEAPS LAW: the vocabulary grows as O(n b ) where b [0,] (usually [0.4,0.6]) } The occurrences demand much more space Occurrence space (in rela$on to original collec$on size) Small collec$on ( Mb) No Stop words All text Medium collec$on (200 Mb) No Stop words All text Large collec$on (2 Gb) No Stop words All text Addressing words 45% 73% 36% 64% 35% 63% Addressing documents 9% 26% 8% 32% 26% 47% 20 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Contents } Introduction } Inverted Indices } Construction } Searching 2 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Sec..2 Inverted index construc$on Documents to be indexed Friends, Romans, countrymen. Tokenizer Friends Romans Countrymen Linguis$c modules Modified tokens Inverted index Indexer friend roman countryman friend roman countryman 2 4 2 3 6
Construc$on based on a trie } All the vocabulary known up to now is kept in a trie structure Steps. Read each word of the text 2. Search the word in the trie 3. If word is not found in the trie, it is added to the trie with an empty list of occurrences 4. If word is in the trie, the new posi$on is added to the end of its list of occurrences } Once the text is exhausted, the trie is wrifen to disk together with the list of occurences Complexity O() opera$ons per text character (step 2) O() inser$on of the posi$on in the list of occurrences (steps 3 & 4) O(n) for the overall process 23 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Splipng the index into two files } Splipng the index into two files allows the vocabulary to be kept in memory at search $me in many cases } Pos$ng file } The lists of occurrences are stored con$guously } Vocabulary file } The vocabulary is stored and, for each word, the number of documents associated with it and a pointer to its list in the pos$ng file is also included 24 GAvI - Full- Text Informa$on Management: Full- Text Indexing
An example Vocabulary file implemented through a sorted array Term Start n boundary case computer database deliver document fan play position science System 0 3 5 6 7 0 3 4 5 7 2 2 3 3 2 0 5 0 5 Pos$ng file 002 3 002 00 003 2 00 2 003 002 004 The pos<ng list of document starts at posi<on 7 and contains 3 entries. 004 00 004 2 003 00 002 002 2 003 2 003 00 document is contained in documents 00 002 003 with frequency 25 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Structures for vocabulary file To accelerate vocabulary searches: } Sorted array } The vocabulary is stored in lexicographical order } Searched using a standard binary search } Complexity O(log 2 vocabulary ) } Disadvantage: upda$ng is expensive } B + - tree } The vocabulary is stored in a B + - tree } Disadvantage: B + - tree uses more space than sorted array } Trie } Hash 26 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Construc$on using PARTIAL INDICES } For large texts where the index does not fit in main memory Construc<on step. The algorithm already described is used un$l the main memory is exhausted. 2. When no more memory is available, the par$al index obtained up to now is wrifen to disk. 3. Erase from main memory. 4. Con$nue with the rest of the text. 5. A number of par$al indices on disk are merged in a hierarchical fashion. 27 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Merging par$al indices Merging step } Given two par$al indexes I and I 2. Merge the sorted vocabularies Complexity O( I + I 2 )) 2. Whenever the same word appears in both indices, merge both lists of occurrences } } By construc$on, the occurrences of the smaller- numbered index are before those of the larger- numbered index We can perform list concatena$on 28 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Merging par$al indices I computer database system D, 20,50,90 D, 30 D, 00 I 2 CAD D 2, 20 database D 2, 0 system 2 D, 20 D 2, 50 I 2 CAD D 2, 20 computer D, 20,50,90 database system 2 2 D, 30 D 2, 0 D, 00 D, 20 D 2, 50 29 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Merging par$al indices Binary fashion } More than two indices can be merged } Complexity: } } } n/m: Number of par$al indexes O(n) merging cost at each level O(n log(n/m)) the overall cost } To reduce build- $me space requirements } It is possible to perform the merging in- place I-..4 3 I-..8 7 I-5..8 I-..2 I-3..4 I-5..6 I-7..8 2 4 5 I- I-2 I-3 I-4 I-5 I-6 I-7 I-8 Level (initial dumps) 6 Level 4 (final index) Level 3 Level 2 30 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Popular indexing systems Informa$on platorm for enterprise Open seman$c search Enterprise search platorm Web crawler Indexing and search technology 3 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Addi$onal material } Inside google data center hfps://www.youtube.com/watch?v=pbx7rgqegg8 } Inverted index compression } Stanford IR book chapter hfp://nlp.stanford.edu/ir- book/html/htmledi$on/index- compression-.html } MaFeo Catena, Craig Macdonald Iadh Ounis On Inverted Index Compression for Search Engine ECIR 204 (available online) } Inverted index distribu$on } Apache SolrCloud hfps://cwiki.apache.org/confluence/display/solr/how +SolrCloud+Works 32 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Contents } Introduction } Inverted Indices } Construction } Searching 33 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Three General steps. Vocabulary search } } The paferns and words present in the query are isolated and searched in the vocabulary } Phrases and proximity queries are split into single words The cost depends on the used structure } Trie or Hash: O(m) thus independent on the text size } B- tree or Sorted Array: O(log n) 2. Retrieval of occurrences } The lists of the occurrences of all the words found are retrieved 3. Manipula$on of occurrences } The occurrences are processed to solve basic queries, phrases, proximity, or Boolean opera$ons 34 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Single- word Retrieval } Single- word queries } Can be searched using any suitable data structure to speed up the search in the vocabulary file } Prefix and range queries } Can be solved with binary search, tries, or B- trees, but not with hashing. 35 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Boolean Retrieval } Given a boolean query } Parse the query into its syntax tree } Leaves: basic queries database AND } Internal nodes: operators Q: database OR science AND case? } Solve the leaves of the query syntax tree using science the appropriate algorithm } Work the relevant documents by composi$on operators } OR: Recursively retrieve e and e 2 and take union of results } AND: Recursively retrieve e and e 2 and take intersec$on of results } BUT: Recursively retrieve e and e 2 and take set difference of results } Op$miza$on } } Algebraic op$miza$on } E.g. a OR (a AND b) = a Keep the set sorted in order to proceed sequen$ally on both lists when intersec$on, union, etc. are required OR case 36 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Intersec$ng two pos$ngs lists (a merge algorithm) 37
Boolean Retrieval Term Start n boundary case computer database deliver document fan play position science System 0 3 5 6 7 0 3 4 5 7 2 2 3 3 2 0 5 0 5 002 3 002 00 003 2 00 2 003 002 004 004 00 004 2 003 00 002 002 2 Q: database OR science AND case? database OR science AND case 003 2 003 00 database OR ( {003,004} {00,004} ) {002} {004} {002,004} 38 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Phrasal Retrieval } BeFer a word- based inverted file Retrieve documents and posi$ons for each individual word 2 intersect documents 3 check for ordered con$guity of keyword posi$ons } Best to start con$guity check with the least common word in the phrase } Simple and effec$ve op$miza$on: Process in order of increasing frequency 39 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Searching: Phrasal Retrieval. Find set of documents D in which all keywords (k k m ) in phrase occur (using AND query processing) 2. Ini$alize empty set, R, of retrieved documents 3. For each document, d, in D:. Get array, P i,of posi$ons of occurrences for each k i in d 2. Find shortest array P s of the P i s 3. For each posi$on p of keyword k s in P s For each keyword k i except k s. Use binary search to find a posi$on (p + i s) in the array P i 4. If correct posi$on for every keyword found, add d to R 4. Return R 40 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Searching: Phrasal Retrieval 2 3 4 5 6 7 This is a text. A text has many words. Words are made from lefers. letters made 6 3 many text words Step 3.3: take 3 7, 2 4, 5 Q: many words? 2 P (many) 3 P P 2 P 2 (words) 4, 5 Step 3.3.: Use binary search to find the posi$on (p + i s) = (3 + 2 - ) in the array P 2 Step 4: add the document to R 4 GAvI - Full- Text Informa$on Management: Full- Text Indexing
Searching: Proximity Retrieval } Use approach similar to phrasal search to find documents in which all keywords are found in a context that sa$sfies the proximity constraints } During binary search for posi$ons of remaining keywords, find closest posi$on of k i to p and check that it is within maximum allowed distance 42 GAvI - Full- Text Informa$on Management: Full- Text Indexing