Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Similar documents
Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Information Retrieval

Index Construction 1

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Introduction to Information Retrieval

Introduc)on to. CS60092: Informa0on Retrieval

Index Construction. Slides by Manning, Raghavan, Schutze

Information Retrieval

Information Retrieval

Information Retrieval

Informa(on Retrieval

Information Retrieval

INDEX CONSTRUCTION 1

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Introduction to Information Retrieval

Introduction to Information Retrieval

CSCI 5417 Information Retrieval Systems Jim Martin!

Introduction to Information Retrieval

CSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4)

PV211: Introduction to Information Retrieval

Information Retrieval and Organisation

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Lecture 3 Index Construction and Compression. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Document Representation : Quiz

CS105 Introduction to Information Retrieval

Introduction to Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Index construc-on. Friday, 8 April 16 1

Index construc-on. Friday, 8 April 16 1

Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Information Retrieval. Chap 8. Inverted Files

Introduction to Information Retrieval

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

Information Retrieval. Danushka Bollegala

Part 2: Boolean Retrieval Francesco Ricci

CS347. Lecture 2 April 9, Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Information Retrieval

Recap: lecture 2 CS276A Information Retrieval

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson

Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Introduc)on to Informa)on Retrieval. Index Construc.on. Slides by Manning, Raghavan, Schutze

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)

Information Retrieval

Information Retrieval

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

Reuters collection example (approximate # s)

Introduction to Information Retrieval

Organização e Recuperação da Informação

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Advanced Retrieval Information Analysis Boolean Retrieval

Information Retrieval

Corso di Biblioteche Digitali

Introduction to Information Retrieval

Distributed computing: index building and use

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39

Big Data Management and NoSQL Databases

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Information Retrieval

Map-Reduce. Marco Mura 2010 March, 31th

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

Building an Inverted Index

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Map Reduce. Yerevan.

CS60092: Informa0on Retrieval

Information Retrieval

Search: the beginning. Nisheeth

Information Retrieval and Text Mining

The MapReduce Abstraction

Distributed computing: index building and use

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Dept. Of Computer Science, Colorado State University

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Map Reduce

Introduction to Information Retrieval

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Information Retrieval

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Information Retrieval

Memory Management. Dr. Yingwu Zhu

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Transcription:

Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

Ch. 3 Outline Scalable index construction BSBI SPIMI Distributed indexing MapReduce Dynamic indexing 2

Ch. 4 Index construction How do we construct an index? What strategies can we use with limited main memory? 3

Sec. 4.1 Hardware basics Many design decisions in information retrieval are based on the characteristics of hardware We begin by reviewing hardware basics 4

Sec. 4.1 Hardware basics Access to memory is much faster than access to disk. Disk seeks: No data is transferred from disk while the disk head is being positioned. Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks. Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). Block sizes: 8KB to 256 KB. 5

Sec. 4.1 Hardware basics Servers used in IR systems now typically have tens of GB of main memory. Available disk space is several (2 3) orders of magnitude larger. size of main memory size of disk space several GB 1 TB or more 2007 Hardware 6

Sec. 4.1 Hardware assumptions for this lecture statistic saverage seek time transfer time per byte processor s clock rate low-level operation value 5 ms = 5 x 10 3 s 0.02 μs = 2 x 10 8 s 10 9 per s 0.01 μs = 10 8 (e.g., compare & swap a word) 2007 Hardware 7

Sec. 4.2 RCV1: Our collection for this lecture As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection. It isn t really large enough This is one year of Reuters newswire (part of 1995 and 1996) 8

Sec. 4.2 A Reuters RCV1 document 9

Sec. 4.2 Reuters RCV1 statistics 4.5 bytes per word token vs. 7.5 bytes per word type: why? 10

Sec. 4.2 Recall: index construction Term Doc # I 1 did 1 enact 1 julius 1 Docs are parsed to extract words and these are saved with the Doc ID. Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. 11 Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious caesar 1 I 1 was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2

Sec. 4.2 Recall: index construction (key step) 12 After all docs have been parsed, the inverted file is sorted by terms. We focus on this sort step. We have 100M items to sort. Term Doc # I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2 Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i' 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2

Sec. 1.2 Recall: Inverted index Posting Brutus 1 2 4 11 31 45 173 Caesar 1 2 4 5 6 16 57 132 Calpurnia 2 31 54 101 Dictionary Postings Sorted by docid 13

Sec. 4.2 Scaling index construction In-memory index construction does not scale Can t stuff entire collection into memory, sort, then write back Indexing for very large collections Taking into account the hardware constraints we just learned about... We need to store intermediate results on disk. 14

Sec. 4.2 Sort using disk as memory? Can we use the same index construction algorithm for larger collections, but by using disk instead of memory? No: Sorting T = 100,000,000 records on disk is too slow Too many disk seeks. Doing this with random disk seeks would be too slow If every comparison needs two disk seeks, we need O(T log T) disk seeks We need an external sorting algorithm. 15

BSBI: Blocked Sort-Based Indexing (Sorting with fewer disk seeks) Basic idea of algorithm: Segments the collection into blocks (parts of nearly equal size) Accumulate postings for each block, sort, write to disk. Then merge the blocks into one long sorted order. 16

17 Sec. 4.2

BSBI: Reuters dataset Must now sort 100M of such records by term. Define a Block ~ 10M such records Can easily fit a couple into memory. Will have 10 such blocks to start with (T=100M). 18

Sec. 4.2 Sorting 10 blocks of 10M records 19

Sec. 4.2 BSBI: terms to termids It is wasteful to use (term, docid) pairs Term must be saved for each pair individually Instead, it uses (termid, docid) and thus needs a data structure for mapping terms to termids This data structure must be in the main memory (termid, docid) are generated as we parse docs. 4+4=8 bytes records 20

21 Sec. 4.2

Sec. 4.2 How to merge the sorted runs? Can do binary merges, with a merge tree of log 2 10 = 4 layers. During each layer, read into memory runs in blocks of 10M, merge, write back. But it is more efficient to do a multi-way merge, where you are reading from all blocks simultaneously Providing you read decent-sized chunks of each block into memory and then write out a decent-sized output chunk Then you re not killed by disk seeks 22

Sec. 4.3 Remaining problem with sort-based algorithm Our assumption was keeping the dictionary in memory. We need the dictionary (which grows dynamically) in order to implement a term to termid mapping. Actually, we could work with <term,docid> postings instead of <termid,docid> postings... but then intermediate files become very large. If we use terms themselves in this method, we would end up with a scalable, but very slow index construction method. 23

Sec. 4.3 SPIMI: Single-Pass In-Memory Indexing Key idea 1: Generate separate dictionaries for each block Term is saved one time (in a block) for the whole of its posting list (not one time for each of the docids containing it) Key idea 2: Accumulate (and implicitly sort) postings in postings lists as they occur. With these two ideas we can generate a complete inverted index for each block. These separate indexes can then be merged into one big index. Merging of blocks is analogous to BSBI. No need to maintain term-termid mapping across blocks 24

Sec. 4.3 SPIMI-Invert SPIMI: O(T) 25 Sort terms before writing to disk Write posting lists in the lexicographic order to facilitate the final merging step

Sec. 4.3 SPIMI: Compression Scalable: SPIMI can index collection of any size (when having enough disk space) It is more efficient than BSBI since it does not allocate a memory to maintain term-termid mapping Some memory is wasted in the posting list (variable size array structure) which counteracts the memory savings from the omission of termids. During the index construction, it is not required to store a separate termid for each posting (as opposed to BSBI) 26

Sec. 4.4 Distributed indexing For web-scale indexing must use a distributed computing cluster Individual machines are fault-prone Can unpredictably slow down or fail Fault tolerance is very expensive It s much cheaper to use many regular machines rather than one fault tolerant machine. How do we exploit such a pool of machines? 27

Google data centers (2007 estimates; Gartner) Google data centers mainly contain commodity machines. Data centers are distributed all over the world. 1 million servers, 3 million processors/cores Google installs 100,000 servers each quarter. 28

Sec. 4.4 Massive data centers If in a non-fault-tolerant system with 1000 nodes, each node has 99.9% uptime, what is the uptime of the system? Answer: 36% Exercise: Calculate the number of servers failing per minute for an installation of 1 million servers. 29

Sec. 4.4 Distributed indexing Maintain a master machine directing the indexing job considered safe. To provide a fault-tolerant system massive data center Break up indexing into sets of (parallel) tasks. Master machine assigns each task to an idle machine from a pool. 30

Sec. 4.4 Parallel tasks We will use two sets of parallel tasks Parsers Inverters Break the input document collection into splits Each split is a subset of docs (corresponding to blocks in BSBI/SPIMI) 31

Sec. 4.4 Data flow assign Master assign Postings Parser a-f g-p q-z Inverter a-f Parser a-f g-p q-z Inverter g-p splits Parser a-f g-p q-z Inverter q-z Map phase Segment files Reduce phase 32

Sec. 4.4 Parsers Master assigns a split to an idle parser machine Parser reads a doc at a time and emits (term, doc) pairs and writes pairs into j partitions Each partition is for a range of terms Example: j = 3 partitions a-f, g-p, q-z terms first letters. 33

Sec. 4.4 Inverters An inverter collects all (term,doc) pairs for one termpartition. Sorts and writes to postings lists 34

Sec. 4.4 MapReduce The index construction algorithm we just described is an instance of MapReduce. MapReduce (Dean and Ghemawat 2004): a robust and conceptually simple framework for distributed computing without having to write code for the distribution part. Google indexing system (ca. 2002): a number of phases, each implemented in MapReduce. 35

Sec. 4.4 Schema for index construction in MapReduce Schema of map and reduce functions map: (k,v) list(key, value) reduce: (key,list(value)) (key2,list(value2)) Example: Map: d1 : C came, C c ed. d2 : C died. Reduce: <C,d1>, <came,d1>, <C,d1>, <c ed,d1>, <C,d2>, <died,d2> (<C,(d1,d2,d1)>, <died,(d2)>, <came,(d1)>, <c ed,(d1)>) (<C,(d1:2,d2:1)>, <died,(d2:1)>, <came,(d1:1)>, <c ed,(d1:1)>) 36

Term-partitioned vs. doc-partitioned Term-partitioned Doc partitioned Load balancing Scalability Disk seek Dynamic 37

Sec. 4.4 MapReduce for index construction Another phase: transforming a term-partitioned index into a document-partitioned index. Term-partitioned: one machine handles a subrange of terms Document-partitioned: one machine handles a subrange of docs Most search engines use a document-partitioned index (better load balancing, etc.) 38

Sec. 4.5 Dynamic indexing Up to now, we have assumed that collections are static. They rarely are: Docs come in over time and need to be inserted. Docs are deleted and modified. This means that the dictionary and postings lists have to be modified: Postings updates for terms already in dictionary New terms added to dictionary 39

Sec. 4.5 Simplest approach Maintain big main index New docs go into small auxiliary index Search across both, merge results Deletions: Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bitvector Periodically, re-index into one main index 40

Sec. 4.5 Issues with main and auxiliary indexes Problem of frequent merges you touch stuff a lot Poor performance during merge Actually: Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list. Merge is the same as a simple append in this case. But then we would need a lot of files inefficient for OS. 41

Sec. 4.5 Dynamic indexing at search engines All the large search engines now do dynamic indexing Their indices have frequent incremental changes News, blogs, new topical web pages But (sometimes/typically) they also periodically reconstruct the index from scratch Query processing is then switched to the new index, and the old index is deleted 42

Ch. 4 Resources Chapter 4 of IIR MG Chapter 5 Original publication on MapReduce: Dean and Ghemawat (2004) Original publication on SPIMI: Heinz and Zobel (2003) 43