Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Similar documents
Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson

Reuters collection example (approximate # s)

Information Retrieval and Organisation

PV211: Introduction to Information Retrieval

Information Retrieval

Introduction to Information Retrieval

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Introduction to Information Retrieval

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index Construction 1

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

CSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4)

Information Retrieval

Introduction to Information Retrieval

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

CSCI 5417 Information Retrieval Systems Jim Martin!

Introduction to Information Retrieval

Information Retrieval

Information Retrieval. Danushka Bollegala

Introduc)on to. CS60092: Informa0on Retrieval

INDEX CONSTRUCTION 1

Information Retrieval

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Introduction to Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Index Construction. Slides by Manning, Raghavan, Schutze

Index construc-on. Friday, 8 April 16 1

Document Representation : Quiz

Building an Inverted Index

Information Retrieval

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Lecture 3 Index Construction and Compression. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Informa(on Retrieval

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Index construc-on. Friday, 8 April 16 1

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Developing MapReduce Programs

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Information Retrieval

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Chapter 2. Architecture of a Search Engine

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Data-Intensive Distributed Computing

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Efficiency vs. Effectiveness in Terabyte-Scale IR

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Information Retrieval

CS105 Introduction to Information Retrieval

Information Retrieval. Lecture 2 - Building an index

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Full-Text Indexing For Heritrix

Information Retrieval

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

Information Retrieval II

Column Stores vs. Row Stores How Different Are They Really?

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

CS November 2018

The Right Read Optimization is Actually Write Optimization. Leif Walsh

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

In-Memory Data Management Jens Krueger

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1

CS November 2017

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Information Retrieval

THE WEB SEARCH ENGINE

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Storage Systems. Storage Systems

COMP6237 Data Mining Searching and Ranking

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Midterm 2, CS186 Prof. Hellerstein, Spring 2012

Information Retrieval

Introduction to Information Retrieval

CS60092: Informa0on Retrieval

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Instructor: Stefan Savev

Mining Web Data. Lijun Zhang

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

CS252 S05. CMSC 411 Computer Systems Architecture Lecture 18 Storage Systems 2. I/O performance measures. I/O performance measures

Introduction to Information Retrieval

Information Retrieval Spring Web retrieval

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

MapReduce Algorithms

Search Engines. Information Retrieval in Practice

Map Reduce. Yerevan.

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Corso di Biblioteche Digitali

Recap: lecture 2 CS276A Information Retrieval

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Transcription:

Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org

Index Construction Overview Introduction Hardware BSBI - Block sort-based indexing SPIMI - Single Pass in-memory indexing Distributed indexing Dynamic indexing Miscellaneous topics

Indices The index has a list of vector space models 1 1998 1 Every 1 Her 1 I 1 I'm 1 Jensen's 2 Julie 1 Letter 1 Most 1 all 1 allegedly 1 back 1 before 1 brings 2 brothers 1 could 1 days 1 dead 1 death 1 everything 1 for 1 from 1 full 1 happens 1 haunts 1 have 1 hear 3 her 1 husband 1 if 1 it 1 killing 1 letter 1 nothing 1 now 1 of 1 pray 1 read, 1 saved 1 sister 1 stands 1 story 1 the 2 they 1 time 1 trial 1 wonder 1 wrong 1 wrote 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1

Indices Term-Document Matrix Capture Keywords A Column for Each Web Page (or Document ) A Row For Each Word (or Term ) 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1... 0 0 0 1 1 4 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 This picture is deceptive it is really very sparse Our queries are terms - not documents We need to invert the vector space model To make postings

Introduction Terms Inverted index (Term, Document) pairs building blocks for working with Term-Document Matrices Index construction (or indexing) The process of building an inverted index from a corpus Indexer The system architecture and algorithm that constructs the index

Indices The index is built from term-document pairs (TERM,DOCUMENT) (1998,www.cnn.com) (Every,www.cnn.com) (Her,www.cnn.com) (I,www.cnn.com) (I'm,www.cnn.com) (Jensen's,www.cnn.com) (Julie,www.cnn.com) (Letter,www.cnn.com) (Most,www.cnn.com) (all,www.cnn.com) (allegedly,www.cnn.com) (back,www.cnn.com) (before,www.cnn.com) (brings,www.cnn.com) (brothers,www.cnn.com) (could,www.cnn.com) (days,www.cnn.com) (dead,www.cnn.com) (death,www.cnn.com) (everything,www.cnn.com) (for,www.cnn.com) (from,www.cnn.com) (full,www.cnn.com) (happens,www.cnn.com) (haunts,www.cnn.com) (have,www.cnn.com) (hear,www.cnn.com) (her,www.cnn.com) (husband,www.cnn.com) (if,www.cnn.com) (it,www.cnn.com) (killing,www.cnn.com) (letter,www.cnn.com) (nothing,www.cnn.com) (now,www.cnn.com) (of,www.cnn.com) (pray,www.cnn.com) (read,,www.cnn.com) (saved,www.cnn.com) (sister,www.cnn.com) (stands,www.cnn.com) (story,www.cnn.com) (the,www.cnn.com) (they,www.cnn.com) (time,www.cnn.com) (trial,www.cnn.com) (wonder,www.cnn.com) (wrong,www.cnn.com) (wrote,www.cnn.com)

Indices The index is built from term-document pairs (TERM,DOCUMENT) (1998,www.cnn.com) (Every,www.cnn.com) (Her,www.cnn.com) (I,www.cnn.com) (I'm,www.cnn.com) (Jensen's,www.cnn.com) (Julie,www.cnn.com) (Letter,www.cnn.com) (Most,www.cnn.com) (all,www.cnn.com) (allegedly,www.cnn.com) (back,www.cnn.com) (before,www.cnn.com) (brings,www.cnn.com) (brothers,www.cnn.com) (could,www.cnn.com) (days,www.cnn.com) (dead,www.cnn.com) (death,www.cnn.com) (everything,www.cnn.com) (for,www.cnn.com) (from,www.cnn.com) (full,www.cnn.com) (happens,www.cnn.com) (haunts,www.cnn.com) (have,www.cnn.com) (hear,www.cnn.com) (her,www.cnn.com) (husband,www.cnn.com) (if,www.cnn.com) (it,www.cnn.com) (killing,www.cnn.com) (letter,www.cnn.com) (nothing,www.cnn.com) (now,www.cnn.com) (of,www.cnn.com) (pray,www.cnn.com) (read,,www.cnn.com) (saved,www.cnn.com) (sister,www.cnn.com) (stands,www.cnn.com) (story,www.cnn.com) (the,www.cnn.com) (they,www.cnn.com) (time,www.cnn.com) (trial,www.cnn.com) (wonder,www.cnn.com) (wrong,www.cnn.com) (wrote,www.cnn.com) Core indexing step is to sort by terms

Indices Term-document pairs make lists of postings (TERM,DOCUMENT, DOCUMENT, DOCUMENT,...) (1998,www.cnn.com,news.google.com,news.bbc.co.uk) (Every,www.cnn.com, news.bbc.co.uk) (Her,www.cnn.com,news.google.com) (I,www.cnn.com,www.weather.com, ) (I'm,www.cnn.com,www.wallstreetjournal.com) (Jensen's,www.cnn.com) (Julie,www.cnn.com) (Letter,www.cnn.com) (Most,www.cnn.com) (all,www.cnn.com) (allegedly,www.cnn.com) A posting is a list of all documents in which a term occurs. This is inverted from how documents naturally occur

Introduction Terms How do we construct an index?

Introduction Interactions An indexer needs raw text We need crawlers to get the documents We need APIs to get the documents from data stores We need parsers (HTML, PDF, PowerPoint, etc.) to convert the documents Indexing the web means this has to be done web-scale

Introduction Construction Index construction in main memory is simple and fast. But: As we build the index we parse docs one at a time Final postings for a term are incomplete until the end. At 10-12 postings per term, large collections demand a lot of space Intermediate results must be stored on disk

Index Construction Overview Introduction Hardware BSBI - Block sort-based indexing SPIMI - Single Pass in-memory indexing Distributed indexing Dynamic indexing Miscellaneous topics

Hardware in 2007 (hasn t changed much through 2014) System Parameters Disk seek time = 0.005 sec (2014: 0.004 hp - 0.015 mobile) Transfer time per byte = 0.00000002 sec Processor clock rate = 0.00000001 sec Size of main memory = several GB Size of disk space = several TB http://en.wikipedia.org/wiki/hard_disk_drive_performance_characteristics

Hardware in 2007 (hasn t changed much through 2014) System Parameters Data is transferred from disk in blocks Operating Systems read data in blocks, so Reading one byte and reading one block take the same amount of time

Hardware in 2007 (hasn t changed much through 2014) System Parameters Disk Seek Time The amount of time to get the disk head to the data About 10 times slower than memory access We must utilize caching No data is transferred during seek Data is transferred from disk in blocks There is no additional overhead to read in an entire block 0.2098 seconds to get 10 MB if it is one block 0.7048 seconds to get 10 MB if it is stored in 100 blocks

System Parameters

System Parameters

Hardware in 2007 System Parameters Data transfers are done on the system bus, not by the processor The processor is not used during disk I/O Assuming an efficient decompression algorithm The total time of reading and then decompressing compressed data is usually less than reading uncompressed data.

Index Construction Overview Introduction Hardware BSBI - Block sort-based indexing SPIMI - Single Pass in-memory indexing Distributed indexing Dynamic indexing Miscellaneous topics

BSBI Reuters collection example (approximate # s) 800,000 documents from the Reuters news feed 200 terms per document 400,000 unique terms number of postings 100,000,000

BSBI Reuters collection example (approximate # s) Sorting 100,000,000 records on disk is too slow because of disk seek time. Parse and build posting entries one at a time Sort posting entries by term Then by document in each term Doing this with random disk seeks is too slow e.g. If every comparison takes 2 disk seeks and N items need to be sorted with N log2(n) comparisons? How long is that going to take?

BSBI Reuters collection example (approximate # s) 100,000,000 records Nlog2(N) is = 2,657,542,475.91 comparisons 2 disk seeks per comparison = 13,287,712.38 seconds x 2 = 26,575,424.76 seconds = 442,923.75 minutes = 7,382.06 hours = 307.59 days = 84% of a year = 1% of your life 2black.wordpress.com

Index Construction Review termid is an index given to a vocabulary word e.g., house = 57820 docid is an index given to a document e.g., news.bbc.co.uk = 74291 posting list is a data structure for the term-document matrix! Term DocID DocID DocID DocID posting list is an inverted data structure

BSBI - Block sort-based indexing Different way to sort index 12-byte records (term, doc, meta-data) Need to sort T= 100,000,000 such 12-byte records by term Define a block to have 1,600,000 such records can easily fit a couple blocks in memory we will be working with 64 such blocks Accumulate postings for each block (real blocks are bigger) Sort each block Write to disk

BSBI - Block sort-based indexing Different way to sort index Crawl System Disk Block that fits in memory (1998, www.cnn.com) (every, www.cnn.com) (I, www.cnn.com) (Jensen's, www.cnn.com) (kite, www.hobby.com) Block that fits in memory (1998, www.hobby.com) (her, news.bbc.co.uk) (I, news.bbc.co.uk) (lion, news.bbc.co.uk) (zebra, news.bbc.co.uk)

BSBI - Block sort-based indexing Different way to sort index Merged Postings (1998, www.cnn.com, www.hobby.com) (every, www.cnn.com) (her, news.bbc.co.uk) (I, www.cnn.com, news.bbc.co.uk (Jensen's, www.cnn.com) (kite, www.hobby.com) (lion, news.bbc.co.uk) (zebra, news.bbc.co.uk) (1998, www.cnn.com) (every, www.cnn.com) (I, www.cnn.com) (Jensen's, www.cnn.com) (kite, www.hobby.com) (1998, www.hobby.com) (her, news.bbc.co.uk) (I, news.bbc.co.uk) (lion, news.bbc.co.uk) (zebra, news.bbc.co.uk)......

BSBI - Block sort-based indexing BlockSortBasedIndexConstruction BlockSortBasedIndexConstruction() 1 n 0 2 while (all documents not processed) 3 do block ParseNextBlock() 4 BSBI-Invert(block) 5 WriteBlockToDisk(block, f n ) 6 MergeBlocks(f 1, f 2..., f n, f merged )

BSBI - Block sort-based indexing Block merge indexing Parse documents into (TermID, DocID) pairs until block is full Invert the block Sort the (TermID,DocID) pairs Compile into TermID posting lists Write the block to disk Then merge all blocks into one large postings file Need 2 copies of the data on disk (input then output)

BSBI - Block sort-based indexing Analysis of BSBI The dominant term is O(TlogT) T is the number of TermID,DocID pairs But in practice ParseNextBlock takes the most time Then MergingBlocks Again, disk seeks times versus memory access times

BSBI - Block sort-based indexing Analysis of BSBI 12-byte records (term, doc, meta-data) Need to sort T= 100,000,000 such 12-byte records by term Define a block to have 1,600,000 such records can easily fit a couple blocks in memory we will be working with 64 such blocks 64 blocks * 1,600,000 records * 12 bytes = 1,228,800,000 bytes Nlog2N comparisons is 5,584,577,250.93 2 touches per comparison at memory speeds (10e-6 sec) = 55,845.77 seconds = 930.76 min = 15.5 hours

Index Construction Overview Introduction Hardware BSBI - Block sort-based indexing SPIMI - Single Pass in-memory indexing Distributed indexing Dynamic indexing Miscellaneous topics

Single-Pass In-Memory Indexing SPIMI BSBI is good but, it needs a data structure for mapping terms to termids this won t fit in memory for big corpora Straightforward solution dynamically create dictionaries store the dictionaries with the blocks integrate sorting and merging

Single-Pass In-Memory Indexing SPIMI-Invert(tokenStream) 1 outputf ile NewFile() 2 dictionary NewHash() 3 while (f ree memory available) 4 do token next(tokenstream) 5 if term(token) / dictionary 6 then postingslist AddToDictionary(dictionary, term(token)) 7 else postingslist GetPostingsList(dictionary, term(token)) 8 if f ull(postingslist) 9 then postingslist DoublePostingsList(dictionary, term(token)) 10 AddToPostingsList(postingsList, docid(token)) 11 sortedt erms SortTerms(dictionary) 12 WriteBlockToDisk(sortedT erms, dictionary, outputf ile) 13 return outputf ile

Single-Pass In-Memory Indexing So what is different here? SPIMI adds postings directly to a posting list. BSBI first collected (TermID,DocID pairs) then sorted them then aggregated the postings Each posting list is dynamic so there is no term sorting Saves memory because a term is only stored once Complexity is O(T) Compression enables bigger effective blocks

Single-Pass In-Memory Indexing Large Scale Indexing Key decision in block merge indexing is block size In practice, spidering often interlaced with indexing Spidering bottlenecked by WAN speed and other factors

Single-Pass In-Memory Indexing

Index Construction Overview Introduction Hardware BSBI - Block sort-based indexing SPIMI - Single Pass in-memory indexing Distributed indexing Dynamic indexing Miscellaneous topics

Index Construction Review termid is an index given to a vocabulary word e.g., house = 57820 docid is an index given to a document e.g., news.bbc.co.uk = 74291 posting list is a data structure for the term-document matrix! Term DocID DocID DocID DocID posting list is an inverted data structure

Index Construction Review BSBI and SPIMI are single pass indexing algorithms leverage fast memory vs slow disk speeds for data sets that won t fit in entirely in memory for data sets that will fit on a single disk

Index Construction Review BSBI builds (termid, docid) pairs until a block is filled builds a posting list in the final merge requires a vocabulary mapping word to termid SPMI builds posting lists until a block is filled combines posting lists in the final merge uses terms directly (not termids)

Index Construction What if your documents don t fit on a single disk? Web-scale indexing Use a distributed computing cluster supported by Cloud computing companies

End of Chapter 4