Index construc-on. Friday, 8 April 16 1

Similar documents
Index construc-on. Friday, 8 April 16 1

Introduc)on to. CS60092: Informa0on Retrieval

Informa(on Retrieval

Information Retrieval and Organisation

Information Retrieval

Information Retrieval

Information Retrieval

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index Construction 1

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Introduction to Information Retrieval

Information Retrieval

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

CS60092: Informa0on Retrieval

Index Construction. Slides by Manning, Raghavan, Schutze

Information Retrieval

CSCI 5417 Information Retrieval Systems Jim Martin!

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

INDEX CONSTRUCTION 1

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Introduction to Information Retrieval

Introduction to Information Retrieval

PV211: Introduction to Information Retrieval

Introduction to Information Retrieval

Building an Inverted Index

CSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4)

Reuters collection example (approximate # s)

Informa(on Retrieval

Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Information Retrieval

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Lecture 3 Index Construction and Compression. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Information Retrieval

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

Information Retrieval. Danushka Bollegala

Informa(on Retrieval

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Introduc)on to Informa)on Retrieval. Index Construc.on. Slides by Manning, Raghavan, Schutze

Transistor: Digital Building Blocks

Main Points. File systems. Storage hardware characteris7cs. File system usage Useful abstrac7ons on top of physical devices

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

CS347. Lecture 2 April 9, Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Recap: lecture 2 CS276A Information Retrieval

Chapter 12: Query Processing

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

Informa(on Retrieval. Administra*ve. Sta*s*cal MT Overview. Problems for Sta*s*cal MT

Recall from Tuesday. Our solution to fragmentation is to split up a process s address space into smaller chunks. Physical Memory OS.

Informa(on Retrieval

ECE331: Hardware Organization and Design

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Part 2: Boolean Retrieval Francesco Ricci

Introduction to Information Retrieval

Developing MapReduce Programs

Algorithms Lecture 11. UC Davis, ECS20, Winter Discrete Mathematics for Computer Science

Advanced Database Systems

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

The Right Read Optimization is Actually Write Optimization. Leif Walsh

Map-Reduce. Marco Mura 2010 March, 31th

Main Points. File systems. Storage hardware characteris7cs. File system usage Useful abstrac7ons on top of physical devices

Document Representation : Quiz

ECS 165B: Database System Implementa6on Lecture 3

Query and Join Op/miza/on 11/5

Chapter 12: Query Processing. Chapter 12: Query Processing

Why Sort? Data requested in sorted order. Sor,ng is first step in bulk loading B+ tree index. e.g., find students in increasing GPA order

CS 61C: Great Ideas in Computer Architecture Direct- Mapped Caches. Increasing distance from processor, decreasing speed.

NOTE: sorting using B-trees to be assigned for reading after we cover B-trees.

Lecture 8: Memory Management

Query Evaluation Strategies

Lecture 5: Information Retrieval using the Vector Space Model

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

CMSC424: Database Design. Instructor: Amol Deshpande

CPSC 330 Computer Organization

Information Retrieval II

Information Retrieval

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Computer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College

Distributed computing: index building and use

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Introduction to Database Systems CSE 444, Winter 2011

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Memory Management. Kevin Webb Swarthmore College February 27, 2018

CS160 - Assignment 2 Due: Friday Sept. 25, 6pm

Query Evaluation Strategies

Mul$media Techniques in Android. Some of the informa$on in this sec$on is adapted from WiseAndroid.com

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Transcription:

Index construc-on Informa)onal Retrieval By Dr. Qaiser Abbas Department of Computer Science & IT, University of Sargodha, Sargodha, 40100, Pakistan qaiser.abbas@uos.edu.pk Friday, 8 April 16 1

4.1 Index construction How do we construct an index? What strategies can we use with limited main memory? Hardware Basics Many design decisions in information retrieval are based on the characteristics of hardware We begin by reviewing hardware basics 2

Hardware basics Access to data in memory is much faster than access to data on disk. Disk seeks: No data is transferred from disk while the disk head is being positioned. Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks. Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). Block sizes: 8KB to 256 KB. 3

Hardware basics Servers used in IR systems now typically have several GB of main memory, sometimes tens of GB. Available disk space is several (2 3)orders of magnitude larger. Fault tolerance is very expensive: It s much cheaper to use many regular machines rather than one fault tolerant machine. 4

Hardware basics 5

4.2 Recall Inverted Index Friday, 8 April 16 6

Earlier approach Pass through the collec)on and assemble all term docid pairs. Sort the pairs with the term as the dominant key and docid as the secondary key. Finally, organize the docids for each term into a pos)ngs list and compute sta)s)cs like term and document frequency. For small collec)ons, all this can be done in memory. However, we will describe methods for large collec)ons that require the use of secondary storage. To make index construc)on more efficient, we represent terms as termids (instead of strings as we did in Figure 1.4), as a unique serial number. Friday, 8 April 16 7

Reuters-RCV1 collec<on The corpus we ll use isn t really large enough, but it s publicly available and is at least a more plausible example. As an example for applying index construction algorithms, we will use the Reuters RCV1 collection (Approx. 1GB). This is one year of Reuters newswire (part of 1996 and 1997) 8

A Reuters RCV1 document 9

Reuters RCV1 statistics

Issue in Indexing Reuters-RCV1 has 100 million tokens. Collec)ng all termid docid pairs of the collec)on using 4 bytes each for termid and docid therefore requires 0.8 GB of storage. Typical collec)ons today are ozen one or two orders of magnitude larger than Reuters-RCV1. You can easily see how such collec)ons overwhelm (bury) even large computers if we try to sort their termid docid pairs in memory. If the size of the intermediate files during index construc)on is within a small factor of available memory, then the compression techniques introduced in Chapter 5 can help; however, the pos)ngs file of many large collec)ons cannot fit into memory even azer compression.

Issue in Indexing With main memory insufficient, we need to use an external sor+ng algorithm, that is, one that uses disk. For acceptable speed, the central requirement of such an algorithm is that it minimize the number of random disk seeks during sor)ng sequen)al disk reads are far faster than seeks as we explained in Sec)on 4.1. One solu)on is the blocked sort-based indexing algorithm or BSBI in Figure 4.2.

BSBI Algorithm

BSBI Algorithm The algorithm parses documents into termid docid pairs and accumulates the pairs in memory un)l a block of a fixed size is full (PARSENEXTBLOCK in Figure 4.2). We choose the block size to fit comfortably into memory to permit a fast in-memory sort. The block is then inverted and wrigen to disk. Inversion involves two steps. First, we sort the termid docid pairs. Next, we collect all termid docid pairs with the same termid into a pos)ngs list, where a pos+ng is simply a docid. The result, an inverted index for the block we have just read, is then wrigen to disk.

BSBI Algorithm Applying this to Reuters-RCV1 and assuming we can fit 10 million termid docid pairs into memory, we end up with ten blocks, each an inverted index of one part of the collec)on. In the final step, the algorithm simultaneously merges the ten blocks into one large merged index. An example with two blocks is shown in Figure 4.3. To do the merging, we open all block files simultaneously, and maintain small read buffers for the ten blocks we are reading and a write buffer for the final merged index we are wri)ng.

BSBI Algorithm

BSBI Algorithm Complexity How expensive is BSBI? Its )me complexity is Θ(T log T) because the step with the highest )me complexity is sor)ng and T is an upper bound for the number of items (i.e., the number of termid docid pairs).

Class Exercise Exercise 4.1 If we need T log T comparisons (where T is the number of termid docid pairs) and 2 two disk seeks for each comparison, how much )me would index construc)on for Reuters-RCV1 take if we used disk instead of memory for storage and an unop)mized sor)ng algorithm (i.e., not an external sor)ng algorithm)? Use the system parameters in Table 4.1.

Solu<on Disk seek )me = 5x10-3 s 2 x (5x10-3 ) seconds per comparison Transfer )me = 2 x 10-8 s per byte Low level opera)ons = 10-8 seconds How long would it take to make T(log₂T) comparisons with 2 disk seeks per comparison? T(log₂T) x 2(5x10-3 s)...consider transfer )me and any low level opera)ons

Class Exercise Exercise 4.2 [ ] How would you create the dic)onary in blocked sort-based indexing on the fly to avoid an extra pass through the data? Solu<on: If you skipped the ini)al step of sor)ng the termids and docids and created a pos)ngs list on the fly whenever you encountered a new termid then created new pos)ngs in that pos)ngs list for each new incidences of termids would you avoid an extra pass through the data and would it s)ll be blocked sort-based indexing?