Information Retrieval. Lecture 2 - Building an index

Size: px

Start display at page:

Download "Information Retrieval. Lecture 2 - Building an index"

Agnes Patrick
6 years ago
Views:

1 Information Retrieval Lecture 2 - Building an index Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester / 40

2 Overview Introduction Introduction Boolean model revisited Building the vocabulary Building an index 2/ 40

3 Introduction Introduction Last time: terminology (inverted indexes, dictionary, posting lists, query, etc). conceptual model for IR (information need, query, retrieval of relevant documents) the boolean model (building an index and applying queries) Today: the boolean model revisited (processing phrase queries) how to build a dictionary (i.e. extracting the keywords from data, and the corresponding postings)? 3/ 40

4 Boolean model revisited Part 1. Boolean model (continued) 4/ 40

5 Boolean model revisited Recall - the boolean model an inverted index associate keywords with posting lists the postings lists contain document identifiers (and other useful information, such as total frequences, number of documents, etc.) boolean queries are processed by merging posting lists in order to find the documents satisfaying the query the cost of this list merging is time linear in the total number of document Ids: O(m + n) question: how to process phrase queries (i.e. taking the word s context into account)? 5/ 40

6 Boolean model revisited Processing phrase queries (1 / 3) queries where the context of the keyword matters, examples: Seminar für Sprachwissenschaft graph theory artificial intelligence... the user wants documents were the whole phrase appears, and not only some parts of it (i.e. ich studiere Sprachwissenschaft is not a match) about 10 % of the web queries are phrase queries (e.g. songs names, institutions, etc). such queries need either more complex dictionary terms, or more complex postings (critical parameter: size of the index). 6/ 40

7 Boolean model revisited Processing phrase queries (2 / 3) A) convert phrase queries into conjunctions of biwords use key-phrases of length 2, example: Phrase: Gottlieb Daimler Stadion Dictionary: (a) Gottlieb Daimler (b) Daimler Stadion the dictionary is made of biwords (notion of context) NB: the dictionary gets bigger than with single keywords may give false positives (words do not occur together) 7/ 40

8 Boolean model revisited Processing phrase queries (3 / 3) B) store positions in the inverted indexes, example: termid ::= doc1: position1, position2,... doc2: position1, position2, processing then corresponds to an extension of the merging algorithm (additional checkings while traversing the lists) NB: such indexes can be used to process proximity queries (i.e. using constraints on proximity between words) 8/ 40

9 Example Boolean model revisited Which documents can contain the sentence to be or not to be considering the following (incomplete) indexes? be ::= 1: 7, 18, 33, 72, 86, 231 2: 3, 149 4: 17, 191, 291, 430, 434 5: 363, 367 to ::= 2: 1, 17, 74, 222, 551 4: 8, 16, 190, 429, 433 7: 13, 23, 191 9/ 40

10 Boolean model revisited Size of positional indexes positional indexes need an entry per occurence (NB: classic inverted indexes need an entry per document Id) the size of such indexes grows exponentially with the size of the document the size of a positional index depends on the language being indexed and the type of document (books, articles, etc) on average, a positional index is 2-4 times bigger than a positional index, it can reach 35 to 50 % of the size of the original text (for English) positional indexes can be used in combination with classic indexes to save time and space (see [Williams et al, 2005]). 10/ 40

11 Building the vocabulary Part 2. Building the vocabulary 11/ 40

12 Building the vocabulary 1. Inter-document parsing 2. Intra-document parsing Tokenization Keyword selection Normalization Stemming / Lemmatization 12/ 40

13 Building the vocabulary Parsing the data (1 / 3) 1. Interdocument parsing: processing the raw data to produce a usable collection of documents By usable, we mean only containing a sequence of characters Problems: document s format (pdf, doc, html, etc.) document s encoding (ISO , UTF-8) document s language (automatic recognition) document unit (example: mbox vs Maildir) 13/ 40

14 Building the vocabulary Parsing the data (2 / 3) 2. Intradocument parsing: processing a stream of characters to extract keywords 1st task: tokenization, main difficulties: token delimiters (ex: Chinese) apostrophes (ex: O neill, Finland s capital) hyphens (ex: Hewlett-Packard, state-of-the-art) segmented compound nouns (ex: Los Angeles) unsegmented compound nouns (ex: Sonderforschungsbereich) numerical data (dates, IP addresses) word order (ex: Arabic wrt nouns and numbers) 14/ 40

15 Building the vocabulary Parsing the data (3 / 3) Solutions for tokenization issues: (a) using a pre-defined dictionary with largest matches and heuristics for unknown words (b) using learning algorithms trained over hand-segmented words 15/ 40

16 Building the vocabulary Choosing keywords Selecting the words that are most likely to appear in a query These words characterize the documents they appear in What about the other words? (i.e. noise words) Words that do not carry informative content wrt queries: high frequency terms (the, a, is, etc.) little semantic content such as function words (if, of, to,etc) Question: can these words be determined automatically to compute a stop list (i.e. words that will be discarded during indexing)? 16/ 40

17 Stop list Building the vocabulary How to build a stop list? 17/ 40

18 Stop list Building the vocabulary How to build a stop list? compute the most frequent terms (either as a preprocessing or from other corpora with equivalent distribution)? 17/ 40

19 Stop list Building the vocabulary How to build a stop list? compute the most frequent terms (either as a preprocessing or from other corpora with equivalent distribution)? be careful, some words like home, life or water are among the 200 most frequently used terms in English [Fox, 1992] 17/ 40

20 Stop list Building the vocabulary How to build a stop list? compute the most frequent terms (either as a preprocessing or from other corpora with equivalent distribution)? be careful, some words like home, life or water are among the 200 most frequently used terms in English [Fox, 1992] sort these words by frequency, and apply a semantic filter according to the domain of discourse NB: some common terms are needed for phrase queries (song names,relation queries, etc) 17/ 40

21 Building the vocabulary Normalization of tokens retrieval needs normalized data examples: acronyms (USA vs U.S.A.), dates (22/10/2007 vs 10/22/2007 vs 2007/10/22), diacritics (Tübingen vs Tuebingen), abbreviations (ca. vs circa), typography (university vs University), etc idea: using equivalence classes of terms, ex: { Opel, OPEL, opel } opel alternative: expanding queries to all member of a class (efficiency issues) NB: documents and queries have to be processed using the same tokenization process! 18/ 40

22 Building the vocabulary Stemming and lemmatization role: reducing inflectional forms to common base forms, example: car, cars, car s, cars car am, are, is be stemming removes suffixes (surface markers) to produce root forms lemmatization reduces a word to a canonical form (using a dictionary and a morphological analyser) illustration of the difficulty: plurals (woman/women, crisis/crisi?) derivational morphology (automatize/automate) English Porter stemming algorithm (University of Cambridge, UK, 1980) 19/ 40

23 Building the vocabulary Porter stemmer (1 / 3) algorithm based on a set of context-sensitive rewriting rules martin/porterstemmer/index.html martin/porterstemmer/def.txt rules are composed of a pattern (left-hand-side) and a string (right-hand-side), example: (.*)sses \1 ss (.* [aeiou].*)ed \1 (.* [aeiou].*)y \1 i rules may be constrained by conditions on the word s measure, example: (m > 1) (.*)ement \1 20/ 40

24 Building the vocabulary Porter stemmer (2 / 3) provided a list of consonants is denoted by C, and a list of vowels by V, any word, or part of a word has one of the four forms: CVCV... C CVCV... V VCVC... C VCVC... V these may all be represented by the single form (m is the measure): [C](VC){m}[V]. examples: m=0 TR, EE, TREE, Y, BY. m=1 TROUBLE, OATS, TREES, IVY. m=2 TROUBLES, PRIVATE, OATEN, ORRERY. 21/ 40

25 Building the vocabulary Porter stemmer (3 / 3) sequential application of these reduction rules within 5 phases (total of 60 rules): Step 1 deals with plurals and past participles Steps 2 to 5 dela with English-specific suffixes whithin a phase, in case of ambiguity the longuest suffix match is prefered once the rule is choosen, its success or failure has no importance wrt to the selection of another rule suffix stripping of a vocabulary of 10,000 words: Number of words reduced in step 1: 3597 step 2: 766 step 3: 327 step 4: 2424 step 5: 1373 Number of words not reduced: / 40

26 Building the vocabulary About existing stemmers (1 / 2) Julie Beth Lovin s stemmer: stemming/general/lovins.htm developed at MIT (US) in 1968 single-pass algorithm, context sensitive, iterative longest-match, 297 endings patterns, use a recoding phase to deal with spelling ambiguity 23/ 40

27 Building the vocabulary About existing stemmers (2 / 2) Paice / Husk s stemmer: stemming/links/paice.htm developed at Lancaster University (UK) in 1990 single table of rules, specifying the removal or replacement of an ending indefinite number of stages, rules are indexed by the last letter of the ending, agressive suffix removal 24/ 40

28 Building the vocabulary About stemming Dangers of stemming: information retrieval vs. information on Golden Retrievers gravity vs gravitation (examples from resp. Amit Singhal and Richard Belew) usefulness of stemming? compressed index, vocabulary reduced from 10 to 50 % (fast processing) vs inconvenience due to poor retrieval (loss of meaning brought by the context) benefits of stemming depend on the language (suits better languages with rich word inflections) 25/ 40

29 Building an index Part 3. Building an index 26/ 40

30 Building an index 1. Block-merge indexing 2. Single-pass indexing 3. Distributed indexing 4. Dynamic indexing 27/ 40

31 Building an index Efficiency considerations (1 / 2) Indexing algorithms depend on hardware characteristics Accessing data in memory is faster than on disk Moving the disk head to a non-contiguous area is time-consuming Operating systems read and write blocks of bytes. Reading a single byte needs the same time as reading the whole block (common block size 2 n KB) Some figures (2007 standards): a disk seek takes s a block transfer from disk takes 10 7 s per byte a processor operation takes 10 7 s a processor s clock cycle is 10 9 s (avg. 1Ghz) 28/ 40

32 Building an index Efficiency considerations (2 / 2) To give an idea: example of the REUTERS collection (Aug. 96 Aug. 97): statistic value documents 800,000 avg. # word tokens per document 200 word types 400,000 avg. # bytes per token (incl. spaces/punct.) 6 avg. # bytes per token (without spaces/punct.) 4.5 avg. # bytes per word type 7.5 non-positional postings 100,000,000 each posting entry is encoded using 12 bytes (4+4+4, term, doc, freq) 29/ 40

33 Recall: indexing Building an index An index can be build by assembling all postings (term, docid) via a first pass through the collection Then postings are sorted according to their term and docid Finally, postings belonging to a given term are compacted into a posting list, and statistics about terms are computed NB: for efficiency reasons, a term can be represented by a termid 30/ 40

34 Building an index Block-merge indexing (1 / 3) The list of posting entries may be huge and not fit the memory space To avoid this, one use an external sorting algorithm Note that disk seeks have to be minimized Principles of the block merge algorithm: (a) split the collection into parts of equal size, (b) sort the postings (termid, docid) corresponding to a part of the collection in memory, (c) store the intermediate result on disk, and (d) merge all intermediate results to produce the final index 31/ 40

35 Building an index Block-merge indexing (2 / 3) Block merge algorithm (from [Manning et al,07]): 1 blockmerge(collection c) 2 n <- 1 3 do 4 block <- parsenextblock(c) 5 invert(block) 6 writetodisc(block, fn) 7 n <- n+1 8 while (c!= []) 9 endwhile 10 return merge([f1.. fn]) NB: merging needs to now the term-termid mapping. 32/ 40

36 Building an index Block-merge indexing (3 / 3) Disk transfer time during merge with the REUTERS sample ( postings): 64 blocks ( entries) 12 bytes b 2 transfers 4, 1 minutes b = 10 7 seconds/byte is the byte transfer rate. 33/ 40

37 Building an index Single-pass indexing (1 / 2) Limit of block merge method: the mapping term-termid may not fit into memory Idea: create a dictionary for each block, and store this dictionary on disk (NB: the dictionary contains the term) A final step merges the dictionaries and postings Principles of the single-pass indexing algorithm: (a) split the collection, (b) create a dictionary for the current split, (c) for each token, if it belongs to the current dictionary, retrieve its postings, if not, create a new empty posting list for it, and add a new entry to the dictionary, (d) add the current token to the posting list, (e) sort the terms, and (f) write the block and dictionary to the disc 34/ 40

38 Building an index Single-pass indexing (2 / 2) Single-pass indexing algorithm (from [Manning et al,07]): 1 invert(stream c) 2 output <- new File() 3 dictionary <- new Hash() 4 while(size < sizemax) do 5 tok <- next(c) 6 if not(token in dictionary) 7 then p_list <- addtodict(dictionary, term(tok)) 8 else p_list <- getfromdict(dictionary,term(tok)) 9 if (full(postin_list)) 10 then p_list <- doublesize(posting_list) 11 addtolist(p_list, tok) 12 endwhile 13 sorted <- sortterms(dictionary) 14 writetodisc(sorted, dictionary, output) 15 return output 35/ 40

39 Building an index Distributed indexing (1 / 4) Huge collections the index cannot be computed on a single machine, it is then partitioned accross several machines We can choose to partition either the postings or the keywords (document-partition versus term-partition) In all cases, the mapping term-termid must be consistent (e.g. precomputed) General idea: using a cluster of machine, each representing a node that performs a sub-task A node may crash, in that case the task it has been given is reallocated to another node A master node manages these task-allocations (e.g. robustness, synchronization) 36/ 40

40 Building an index Distributed indexing (2 / 4) General architecture for indexing: MapReduce The collection is split (size of each fragment computed for efficiency, trade-off between flexibility and read/write time access) The Map phase applies on each split a parser which computes the posting lists contained in the split The output of a parsing is store in local intermediate files called segment files Each segment file contains the postings for some terms (c.f. term-partition) The Reduce phase assign an inverter to a term-partition, this inverter collects the corresponding postings within the segment files 37/ 40

41 Building an index Distributed indexing (3 / 4) J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters (2004) mapreduce-osdi04.pdf 38/ 40

42 Building an index Distributed indexing (4 / 4) Remarks: The number of partition is a parameter of the indexing system The posting list associated with a single term is supposed to fit in a single machine s memory Parsers and inverters are not separate machines, according to their availability, the master node assigns a machine either parsing or inverting For efficiency reasons, the network traffic is reduced as much as possible 39/ 40

43 Building an index To be continued... 40/ 40

Information Retrieval CS-E credits

Information Retrieval CS-E credits Information Retrieval CS-E4420 5 credits Tokenization, further indexing issues Antti Ukkonen antti.ukkonen@aalto.fi Slides are based on materials by Tuukka Ruotsalo, Hinrich Schütze and Christina Lioma