Building an Inverted Index

Similar documents
Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

1 o Semestre 2007/2008

Information Retrieval and Organisation

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Information Retrieval

Information Retrieval

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson

Information Retrieval

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

Introduc)on to. CS60092: Informa0on Retrieval

Index Construction 1

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construc-on. Friday, 8 April 16 1

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

CS542. Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner. Worcester Polytechnic Institute

Reuters collection example (approximate # s)

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Introduction. Secondary Storage. File concept. File attributes

Introduction to Information Retrieval

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.

Information Retrieval

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

COMP6237 Data Mining Searching and Ranking

COSEQUENTIAL PROCESSING (SORTING LARGE FILES)

Information Retrieval

Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Midterm spring. CSC228H University of Toronto

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

CSCI 5417 Information Retrieval Systems Jim Martin!

Introduction to Information Retrieval

Chapter 18 Indexing Structures for Files

Index Construction. Slides by Manning, Raghavan, Schutze

CS4500/5500 Operating Systems File Systems and Implementations

Query Evaluation Strategies

Sorting Algorithms. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Introduction to Information Retrieval

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

EXTERNAL SORTING. Sorting

Introduction to Information Retrieval

Compressing Integers for Fast File Access

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy

Virtual Memory 2. q Handling bigger address spaces q Speeding translation

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Information Retrieval

Understanding the Optimizer

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

Document Representation : Quiz

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Why Sort? Why Sort? select... order by

Information Retrieval

THE WEB SEARCH ENGINE

Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015

Indexing and Searching

Informa(on Retrieval

LINEAR INSERTION SORT

Indexing and Searching

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Secondary Storage (Chp. 5.4 disk hardware, Chp. 6 File Systems, Tanenbaum)

Advance Indexing. Limock July 3, 2014

Memory Management. COMP755 Advanced Operating Systems

SORTING. How? Binary Search gives log(n) performance.

Fundamentals of Database Systems

Introduction to Information Retrieval

CS 134: Operating Systems

On the Performance of MapReduce: A Stochastic Approach

COMP 530: Operating Systems File Systems: Fundamentals

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Corso di Biblioteche Digitali

Information Retrieval. Chap 8. Inverted Files

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

The Effectiveness of Deduplication on Virtual Machine Disk Images

We can use a max-heap to sort data.

Advanced Database Systems

File Systems. CS 4410 Operating Systems

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

File Systems. CS170 Fall 2018

Full-Text Indexing For Heritrix

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

INDEX CONSTRUCTION 1

The Right Read Optimization is Actually Write Optimization. Leif Walsh

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

EECS 482 Introduction to Operating Systems

CMSC424: Database Design. Instructor: Amol Deshpande

Virtual Memory 2. To do. q Handling bigger address spaces q Speeding translation

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

CSE Data Structures and Introduction to Algorithms... In Java! Instructor: Fei Wang. Mid-Term Exam. CSE2100 DS & Algorithms 1

Today: Secondary Storage! Typical Disk Parameters!

ECE331: Hardware Organization and Design

For searching and sorting algorithms, this is particularly dependent on the number of data elements.

Pharmacy college.. Assist.Prof. Dr. Abdullah A. Abdullah

File Systems: Fundamentals

Merge Sort Goodrich, Tamassia Merge Sort 1

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

File Systems: Fundamentals

Transcription:

Building an Inverted Index

Algorithms Memory-based Disk-based (Sort-Inversion) Sorting Merging (2-way; multi-way) 2

Memory-based Inverted Index Phase I (parse and read) For each document Identify distinct terms in the document Update, in memory the posting list for each term Phase II (write) For each distinct term in the index Write the inverted index to disk (feel free to compress the posting list while writing it) 3

Variables B = Text size (figure 5 GB to get started) N = Number of Documents (about 500,000) n = Distinct Terms (index size) F = Number of Words (about 800,000,000 f = Number of posting list entries (about 400,000,000 I = size of compressed file (about 400MB) L = size of index = 3 MB Disk seek time = t s =.01 seconds Disk Transfer time = t r =.000005 seconds Time to compare and swap 10 byte records = t c. = 000001 seconds Time to parse, stem, and look up one term = t p =.000020 seconds M = Main memory available = (figure 40 MB, a low number) 4

Memory Only Analysis Time to read and parse R = B t r + F t p Time to write W = I(t d + t r ) Took about six hours to index a 5GB collection Fine, except for memory requirements. 5

Memory Requirements While in memory the posting list is not compressed. Typical entry DocID tf nextpointer (4 bytes) (2 bytes) (4 bytes) For an 800,000,000 word collection, 400,000,000 posting list entries were needed (many terms do not result in a posting list entry because of stop words and duplicate occurences of a term within a document). With 400,000,000 posting list entries, at 10 bytes per entry, we obtain a memory requirement of 4GB. 6

Summary Pro Very fast algorithm that is easy to implement. Con Doesn t work at all if you run out of main memory. Once memory runs out you start swapping. For small document collections this is a great algorithm, for anything realistic it requires a lot of memory. 7

Simple Alternatives that do not Work at All Simply storing the posting lists on disk would require a tremendous amount of I/O. One estimate shows SIX WEEKS to run this for 5 GB collection. Alternatively, we could let the OS take care of the memory and just use virtual memory to solve the problem. This results in significant swapping (13 days). 8

Disk-based Phase I Create temp files of triples (termid, docid, tf) Phase II Sort the triples using external mergesort Phase III Merge the sorted triples files (2-way; m-way) Phase IV Build Inverted index from sorted triples 9

Disk-based (Cont d) Phase I (parse and build temp file) For each document Parse text into terms, assign a term to a termid (use an internal index for this) For each distinct term in the document Write an entry to a temporary file with only triples <termid, docid, tf) Phase II (make sorted runs, to prepare for merge) Do Until End of Temporary File Sort the triples in memory by term id and doc id. Write them out in a sorted run on disk. 10

Disk-based (Cont d) Run1: tid did tf 1 d1 2 3 d1 1 Sorted: tid did tf 1 d1 2 1 d2 1 5 d1 2 2 d1 4 2 d1 4 2 d2 3 4 d1 1 3 d1 1 1 d2 1 4 d1 1 2 d2 3 5 d1 2 5 d2 3 5 d2 3 Run2: tid did tf 1 d3 2 2 d3 1 Sorted: tid did tf 1 d3 2 1 d4 2 4 d3 3 2 d3 1 2 d4 2 2 d4 2 3 d4 1 3 d4 1 5 d4 2 4 d3 3 4 d4 1 4 d4 1 1 d4 2 5 d4 2 11

Disk-based (Cont d) Phase III (merge the runs) Repeat until there is only one run Phase IV Merge pair-wise (2-way) or m-way sorted runs into a single run. For each distinct term in final sorted run Start a new inverted file entry. Read all triples for a given term (these will be in sorted order) Build the posting list (feel free to use compression) Write (append) this entry to the inverted index into a binary file. 12

Sort-based (Cont d) Sorted Run1: Sorted Run2: tid did tf 1 d1 2 1 d2 1 2 d1 4 2 d2 3 3 d1 1 4 d1 1 5 d1 2 5 d2 3 tid did tf 1 d3 2 1 d4 2 2 d3 1 2 d4 2 3 d4 1 4 d3 3 4 d4 1 5 d4 2 Merged: tid did tf 1 d1 2 1 d2 1 1 d3 2 1 d4 2 2 d1 4 2 d2 3 2 d3 1 2 d4 2 3 d1 1 3 d4 1 4 d1 1 4 d3 3 4 d4 1 5 d1 2 5 d2 3 5 d4 2 13

Disk-based (Cont d) Final Sorted Run: tid did tf 1 d1 2 1 d2 1 1 d3 2 1 d4 2 t1 t2 d1,2 d2,1 d1,4 d2,3 d3,2 d3,1 d4,2 d4,2 2 d1 4 2 d2 3 2 d3 1 t3 d1,1 d4,1 Stream of Posting List Nodes 2 d4 2 3 d1 1 t4 d1,1 d3,3 d4,1 3 d4 1 4 d1 1 4 d3 3 t5 d1,2 d2,3 d4,2 4 d4 1 5 d1 2 5 d2 3 5 d4 2 Inverted Index 14

Time (estimates from testing) Read and Parse (5 hours) Write temp file (30 minutes) Sort (4 hours, 3 hours sort + 1 hour for r/w of temp file) With 40MB and 400,000,000 triples in the temporary file we can hold 4,000,000, 10 byte triples in memory. So we have 100 runs. Merge (7 hours for I/O, 2 hours computation) log100 = 7 passes, each pass is a full r/w of temp file compute time to do the merge (comparisons of runs) Read sorted temp file and build inverted index (1.5 hours) Total time (about 20 hours) 15

Analysis Time to read and parse, write file R = Bt r + F t p + 10 f t r Time to sort 20 f t r (read and write) + R(1.2k log k) t c (time to sort a run) k number of triples that fit into memory R number of runs to merge Time to merge (log R) (20 f t r + f t c ) (read, write, compare a run) Write the final inverted file 10 f t r + I(t d +t r ) 16

Disk Space Requirements Now that we have satisfied the low memory requirements, lets look at disk space. For a file of size T we need size 2T of disk space. Final merge will be writing a new temp file of size T and we will still be reading from a temp file of size T. 17

Summary Pro Not as fast as memory based, but at least is usable. Sort and merge time dominates cost. Con Requires twice the amount of disk space as the size of the original text. 18