Index construc-on Informa)onal Retrieval By Dr. Qaiser Abbas Department of Computer Science & IT, University of Sargodha, Sargodha, 40100, Pakistan qaiser.abbas@uos.edu.pk Friday, 8 April 16 1
4.1 Index construction How do we construct an index? What strategies can we use with limited main memory? Hardware Basics Many design decisions in information retrieval are based on the characteristics of hardware We begin by reviewing hardware basics 2
Hardware basics Access to data in memory is much faster than access to data on disk. Disk seeks: No data is transferred from disk while the disk head is being positioned. Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks. Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). Block sizes: 8KB to 256 KB. 3
Hardware basics Servers used in IR systems now typically have several GB of main memory, sometimes tens of GB. Available disk space is several (2 3)orders of magnitude larger. Fault tolerance is very expensive: It s much cheaper to use many regular machines rather than one fault tolerant machine. 4
Hardware basics 5
4.2 Recall Inverted Index Friday, 8 April 16 6
Earlier approach Pass through the collec)on and assemble all term docid pairs. Sort the pairs with the term as the dominant key and docid as the secondary key. Finally, organize the docids for each term into a pos)ngs list and compute sta)s)cs like term and document frequency. For small collec)ons, all this can be done in memory. However, we will describe methods for large collec)ons that require the use of secondary storage. To make index construc)on more efficient, we represent terms as termids (instead of strings as we did in Figure 1.4), as a unique serial number. Friday, 8 April 16 7
Reuters-RCV1 collec<on The corpus we ll use isn t really large enough, but it s publicly available and is at least a more plausible example. As an example for applying index construction algorithms, we will use the Reuters RCV1 collection (Approx. 1GB). This is one year of Reuters newswire (part of 1996 and 1997) 8
A Reuters RCV1 document 9
Reuters RCV1 statistics
Issue in Indexing Reuters-RCV1 has 100 million tokens. Collec)ng all termid docid pairs of the collec)on using 4 bytes each for termid and docid therefore requires 0.8 GB of storage. Typical collec)ons today are ozen one or two orders of magnitude larger than Reuters-RCV1. You can easily see how such collec)ons overwhelm (bury) even large computers if we try to sort their termid docid pairs in memory. If the size of the intermediate files during index construc)on is within a small factor of available memory, then the compression techniques introduced in Chapter 5 can help; however, the pos)ngs file of many large collec)ons cannot fit into memory even azer compression.
Issue in Indexing With main memory insufficient, we need to use an external sor+ng algorithm, that is, one that uses disk. For acceptable speed, the central requirement of such an algorithm is that it minimize the number of random disk seeks during sor)ng sequen)al disk reads are far faster than seeks as we explained in Sec)on 4.1. One solu)on is the blocked sort-based indexing algorithm or BSBI in Figure 4.2.
BSBI Algorithm
BSBI Algorithm The algorithm parses documents into termid docid pairs and accumulates the pairs in memory un)l a block of a fixed size is full (PARSENEXTBLOCK in Figure 4.2). We choose the block size to fit comfortably into memory to permit a fast in-memory sort. The block is then inverted and wrigen to disk. Inversion involves two steps. First, we sort the termid docid pairs. Next, we collect all termid docid pairs with the same termid into a pos)ngs list, where a pos+ng is simply a docid. The result, an inverted index for the block we have just read, is then wrigen to disk.
BSBI Algorithm Applying this to Reuters-RCV1 and assuming we can fit 10 million termid docid pairs into memory, we end up with ten blocks, each an inverted index of one part of the collec)on. In the final step, the algorithm simultaneously merges the ten blocks into one large merged index. An example with two blocks is shown in Figure 4.3. To do the merging, we open all block files simultaneously, and maintain small read buffers for the ten blocks we are reading and a write buffer for the final merged index we are wri)ng.
BSBI Algorithm
BSBI Algorithm Complexity How expensive is BSBI? Its )me complexity is Θ(T log T) because the step with the highest )me complexity is sor)ng and T is an upper bound for the number of items (i.e., the number of termid docid pairs).
Class Exercise Exercise 4.1 If we need T log T comparisons (where T is the number of termid docid pairs) and 2 two disk seeks for each comparison, how much )me would index construc)on for Reuters-RCV1 take if we used disk instead of memory for storage and an unop)mized sor)ng algorithm (i.e., not an external sor)ng algorithm)? Use the system parameters in Table 4.1.
Solu<on Disk seek )me = 5x10-3 s 2 x (5x10-3 ) seconds per comparison Transfer )me = 2 x 10-8 s per byte Low level opera)ons = 10-8 seconds How long would it take to make T(log₂T) comparisons with 2 disk seeks per comparison? T(log₂T) x 2(5x10-3 s)...consider transfer )me and any low level opera)ons
Class Exercise Exercise 4.2 [ ] How would you create the dic)onary in blocked sort-based indexing on the fly to avoid an extra pass through the data? Solu<on: If you skipped the ini)al step of sor)ng the termids and docids and created a pos)ngs list on the fly whenever you encountered a new termid then created new pos)ngs in that pos)ngs list for each new incidences of termids would you avoid an extra pass through the data and would it s)ll be blocked sort-based indexing?