Reuters collection example (approximate # s)

Similar documents
Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson

PV211: Introduction to Information Retrieval

Information Retrieval

Information Retrieval and Organisation

Introduction to Information Retrieval

Introduction to Information Retrieval

CSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4)

Introduction to Information Retrieval

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index Construction 1

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

Information Retrieval

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Document Representation : Quiz

Index construc-on. Friday, 8 April 16 1

Information Retrieval

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

CSCI 5417 Information Retrieval Systems Jim Martin!

Introduction to Information Retrieval

Information Retrieval. Danushka Bollegala

Building an Inverted Index

Index Construction. Slides by Manning, Raghavan, Schutze

Introduc)on to. CS60092: Informa0on Retrieval

Information Retrieval

Introduction to Information Retrieval

INDEX CONSTRUCTION 1

Index construc-on. Friday, 8 April 16 1

Lecture 3 Index Construction and Compression. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Informa(on Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Information Retrieval

Information Retrieval. Lecture 2 - Building an index

Information Retrieval

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Efficiency vs. Effectiveness in Terabyte-Scale IR

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

CS105 Introduction to Information Retrieval

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

COSEQUENTIAL PROCESSING (SORTING LARGE FILES)

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Organização e Recuperação da Informação

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

CS60092: Informa0on Retrieval

Logging File Systems

CS160 - Assignment 2 Due: Friday Sept. 25, 6pm

1 o Semestre 2007/2008

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Query Answering Using Inverted Indexes

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Information Retrieval

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

COMP6237 Data Mining Searching and Ranking

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Lecture 5: Information Retrieval using the Vector Space Model

DB-retrieval: I m sorry, I can only look up your order, if you give me your OrderId.

Appendix A1: Fig A1-1. Examples of build methods for distributed inverted files

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Complexity Analysis of an Algorithm

Information Retrieval

Document indexing, similarities and retrieval in large scale text collections

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Corso di Biblioteche Digitali

Information Retrieval

Query Evaluation Strategies

Advanced Retrieval Information Analysis Boolean Retrieval

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Query Evaluation Strategies

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy

Sorting atomic items. Chapter 5

Text Retrieval an introduction

COMP Parallel Computing. Lecture 22 November 29, Datacenters and Large Scale Data Processing

The Berkeley File System. The Original File System. Background. Why is the bandwidth low?

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval

Data-Intensive Distributed Computing

Introduction to Information Retrieval

Information Retrieval II

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1

Information Retrieval

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

THE WEB SEARCH ENGINE

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.

230 Million Tweets per day

Midterm spring. CSC228H University of Toronto

Transcription:

BSBI Reuters collection example (approximate # s) 800,000 documents from the Reuters news feed 200 terms per document 400,000 unique terms number of postings 100,000,000

BSBI Reuters collection example (approximate # s) Sorting 100,000,000 records on disk is too slow because of disk seek time. Parse and build posting entries one at a time Sort posting entries by term Then by document in each term Doing this with random disk seeks is too slow e.g. If every comparison takes 2 disk seeks and N items need to be sorted with N log2(n) comparisons? 306ish days?

BSBI Reuters collection example (approximate # s) 100,000,000 records Nlog2(N) is = 2,657,542,475.91 comparisons 2 disk seeks per comparison = 13,287,712.38 seconds x 2 = 26,575,424.76 seconds = 442,923.75 minutes = 7,382.06 hours = 307.59 days = 84% of a year = 1% of your life

BSBI - Block sort-based indexing Different way to sort index 12-byte records (term, doc, meta-data) Need to sort T= 100,000,000 such 12-byte records by term Define a block to have 1,600,000 such records can easily fit a couple blocks in memory we will be working with 64 such blocks Accumulate postings for each block (real blocks are bigger) Sort each block Write to disk Then merge

BSBI - Block sort-based indexing Different way to sort index Block Block Merged Postings (1998,www.cnn.com) (Every,www.cnn.com) (Her,news.google.com) (I'm,news.bbc.co.uk) (1998,news.google.com) (Her,news.bbc.co.uk) (I,www.cnn.com) (Jensen's,www.cnn.com) (1998,www.cnn.com) (1998,news.google.com) (Every,www.cnn.com) (Her,news.bbc.co.uk) (Her,news.google.com) (I,www.cnn.com) (I'm,news.bbc.co.uk) (Jensen's,www.cnn.com) Disk

BSBI - Block sort-based indexing BlockSortBasedIndexConstruction BlockSortBasedIndexConstruction() 1 n 0 2 while (all documents not processed) 3 do block ParseNextBlock() 4 BSBI-Invert(block) 5 WriteBlockToDisk(block, f n ) 6 MergeBlocks(f 1, f 2..., f n, f merged )

BSBI - Block sort-based indexing Block merge indexing Parse documents into (TermID, DocID) pairs until block is full Invert the block Sort the (TermID,DocID) pairs Compile into TermID posting lists Write the block to disk Then merge all blocks into one large postings file Need 2 copies of the data on disk (input then output)

BSBI - Block sort-based indexing Analysis of BSBI The dominant term is O(TlogT) T is the number of TermID,DocID pairs But in practice ParseNextBlock takes the most time Then MergingBlocks Again, disk seeks times versus memory access times

BSBI - Block sort-based indexing Analysis of BSBI 12-byte records (term, doc, meta-data) Need to sort T= 100,000,000 such 12-byte records by term Define a block to have 1,600,000 such records can easily fit a couple blocks in memory we will be working with 64 such blocks 64 blocks * 1,600,000 records * 12 bytes = 1,228,800,000 bytes Nlog2N comparisons is 5,584,577,250.93 2 touches per comparison at memory speeds (10e-6 sec) = 55,845.77 seconds = 930.76 min = 15.5 hours

Index Construction Overview Introduction Hardware BSBI - Block sort-based indexing SPIMI - Single Pass in-memory indexing Distributed indexing Dynamic indexing Miscellaneous topics

Single-Pass In-Memory Indexing SPIMI BSBI is good but, it needs a data structure for mapping terms to termids this won t fit in memory for big corpora Straightforward solution dynamically create dictionaries store the dictionaries with the blocks

Single-Pass In-Memory Indexing SPIMI BSBI is good but, it needs a data structure for mapping terms to termids this won t fit in memory for big corpora Straightforward solution dynamically create dictionaries store the dictionaries with the blocks

Single-Pass In-Memory Indexing SPIMI-Invert(tokenStream) 1 outputf ile NewFile() 2 dictionary NewHash() 3 while (f ree memory available) 4 do token next(tokenstream) 5 if term(token) / dictionary 6 then postingslist AddToDictionary(dictionary, term(token)) 7 else postingslist GetPostingsList(dictionary, term(token)) 8 if f ull(postingslist) 9 then postingslist DoublePostingsList(dictionary, term(token)) 10 AddToPostingsList(postingsList, docid(token)) 11 sortedt erms SortTerms(dictionary) 12 WriteBlockToDisk(sortedT erms, dictionary, outputf ile) 13 return outputf ile

Single-Pass In-Memory Indexing So what is different here? SPIMI adds postings directly to a posting list. BSBI first collected (TermID,DocID pairs) then sorted them then aggregated the postings Each posting list is dynamic so there is no posting list sorting Saves memory because a term is only stored once Complexity is more like O(T) Compression enables bigger effective blocks

Single-Pass In-Memory Indexing Large Scale Indexing Key decision in block merge indexing is block size In practice, spidering often interlaced with indexing Spidering bottlenecked by WAN speed and other factors