1 o Semestre 2007/2008

Size: px
Start display at page:

Download "1 o Semestre 2007/2008"

Transcription

1 Efficient Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008

2 Outline

3 Outline

4 Text es An index is a mechanism to locate a given term in the document collection types: inverted files signature files bitmaps

5 Definitions A document collection is a set of separate documents Documents are composed of terms es provide efficient access to the documents given one or more terms es can have different granularities: set of documents document paragraph word

6 A Document Collection Doc. Text 1 That government is best which governs least 2 That government is best which governs not at all 3 The mass of men serve the state not as men, but as machines 4 Wooden men can be manufactured that will serve the purpose as well 5 Government is at best but an expedient 6 But most governments are usually inexpedient

7 An Inverted File Lexicon Num. Term 1 best 2 expedient 3 government 4 governs 5 inexpedient 6 least 7 machines 8 manufactured 9 mass 10 men 11 purpose 12 serve 13 state 14 wooden Inverted file Num. Inverted list 1 3;1,2,5 2 1; 5 3 4;1,2,5,6 4 2;1,2 5 1; 6 6 1; 1 7 1; 3 8 1; 4 9 1; ;3,4 11 1; ;3,4 13 1; ; 4

8 Inverted file with word granularity Inverted File Num. Inverted list 1 3;(1;4),(2;4),(5;4) 2 1;(5;7) 3 4;(1;2),(2;2),(5;1),(6;3) 4 2;(1;6),(2;6) 5 1;(6;6) 6 1;(1;7) 7 1;(3;13) 8 1;(4;5) 9 1;(3;2) 10 3;(3;4),(3;10),(4;2) 11 1;(4;10) 12 2;(3;5),(4;8) 13 1;(3;7) 14 1;(4;1)

9 Query Processing Query processing involves searching the index and manipulating the inverted lists

10 Query Processing Query processing involves searching the index and manipulating the inverted lists For instance:

11 Query Processing Query processing involves searching the index and manipulating the inverted lists For instance: Query government and men intersection between inverted lists 3 and 10 (= );

12 Query Processing Query processing involves searching the index and manipulating the inverted lists For instance: Query government and men intersection between inverted lists 3 and 10 (= ); Query men or machines union between inverted lists 10 and 7 (= {3,4});

13 Query Processing Query processing involves searching the index and manipulating the inverted lists For instance: Query government and men intersection between inverted lists 3 and 10 (= ); Query men or machines union between inverted lists 10 and 7 (= {3,4}); Query men and not machines inverted list 10 except inverted list 7 (= {4});

14 Outline

15 An example collection Text size B bytes N. of Documents N N. of distinct terms n Total n. of terms F N. of index pointers f Size of compressed inverted file I bytes Size of lexicon L bytes Disk seek time t s sec Disk transfer time per byte t r sec Coding time per byte t d sec Time to compare and swap 10 byte records t c 10 6 sec Time to parse, stem, look up one term t p sec Amount of main memory available M bytes

16 A naive solution 1 Create a document term frequency matrix doc term Write matrix to disk one column at a time

17 A naive solution 1 Create a document term frequency matrix doc term Write matrix to disk one column at a time Memory requirements (using 4 bytes/entry): = 18Tbytes

18 Outline

19 Using a dictionary structure best expedient 4 2 government governs

20 Using a dictionary structure best expedient 4 2 government governs Time requirements T = Bt r + Ft p + I(t d + t r ) parse text write file 6h B = text size F = n. terms t r = transfer rate t p = parse time I = file size t d = coding time

21 Using a dictionary structure best expedient 4 2 government governs Time requirements T = Bt r + Ft p + I(t d + t r ) parse text write file 6h Space requirements S = ( ) pointer size n. of pointers 4Gb

22 Outline

23 d If virtual memory was used, the dictionary-based algorithm would take about 6 weeks to index the 5 Gbyte collection Mainly due to disk seek time Solution: allow only sequential disk access

24 d Algorithm (I) 1 Create an empty dictionary structure S; Create an empty temporary file T on disk; 2 For each document D d (1 d N): 1 Read and parse D d ; 2 For each triple t,d,f d,t 1 Store t in S; 2 Store t, d, f d,t in T; S that 1 government 2 best 3 governs 4 least 5... T 1, 1, 1 2, 1, 1 3, 1, 1 4, 1, 1 5, 1, 1 2, 2, 1 3, 2, 1 4, 2, 1 5, 3, 1 6, 3, 2...

25 d Algorithm (II) 3 Le k be the number of records that fit in memory 1 Read k records from T; 2 Sort according to t and d; 3 Write the sorted run into T; 4 Repeat until all runs are sorted; 4 Pairwise merge the sorted runs; T 1,1,1 2,1,1 3,1,1 4,1,1 5,1,1 2,2,1 3,2,1 4,2,1 5,3,1 6,3,2... T 1,1,1 2,1,1 3,1,1 2,2,1 4,1,1 5,1,1 3,2,1 4,2,1 5,3,1 6,3,2...

26 d Algorithm (II) 3 Le k be the number of records that fit in memory 1 Read k records from T; 2 Sort according to t and d; 3 Write the sorted run into T; 4 Repeat until all runs are sorted; 4 Pairwise merge the sorted runs; T 1,1,1 2,1,1 3,1,1 2,2,1 4,1,1 5,1,1 3,2,1 4,2,1 5,3,1 6,3,2... T 1,1,1 2,1,1 2,2,1 3,1,1 3,2,1 4,1,1 4,2,1 5,1,1 5,3,1 6,3,2...

27 d Algorithm (III) 5 For each term t (1 t n): 1 Start a new inverted file entry; 2 Read all triples t,d,f d,t from T and form the inverted list of t; 3 Compress the inverted list; 4 Append the list to the inverted file. T 1,1,1 2,1,1 2,2,1 3,1,1 3,2,1 4,1,1 4,2,1 5,1,1 5,3,1 6,3,2... t Inverted list 1 1;(1;1) 2 2;(1;1),(2;1) 3 2;(1;1),(2;1) 4 2;(1;1),(2;1) 5 2;(1;1),(3;1) 6 1;(3;2)...

28 Requirements Time requirements T = Bt r + Ft p + 10ft r read,parse,write +20ft r + R(1.2k log k)t c sort runs + log R (20ft r + ft c ) merge runs +10ft r + I(t d + t r ) write inverted file 20h B = text size t r = transfer rate F = total n. terms t p = parse time f = n. index pointers R = n. runs k = size of run t c = compare+swap I = file size t d = coding time

29 Requirements Time requirements T = Bt r + Ft p + 10ft r read,parse,write +20ft r + R(1.2k log k)t c sort runs + log R (20ft r + ft c ) merge runs +10ft r + I(t d + t r ) write inverted file 20h Space requirements S = temp. file size plus one copy 8Gb

30 Compression of the Temporary File The d, f d,t pairs can be compressed using methods appropriate for small integers E.g., Elias-δ and unary coding How to code t?

31 Using t-gaps Term numbers can be coded as sequences of t-gaps within each sorted run... 2,2,1 4,1,1 5,1, ,2,1 3,1,1 2,1,1... A t-gap x is encoded as value x + 1, using unary coding Each run requires a maximum of k + n bits

32 Compression Requirements Time requirements Runs must be sorted before writing to disk Must share space with dictionary shorter runs more runs to merge Requirements: T = Bt r + Ft p read,parse +R(1.2k log k)t c + I (t r + t d ) sort,compress,write log R (2I (t r + t d ) + ft c ) merge runs (I + I)(t r + t d ) recompress 26h B = text size t r = transfer rate F = total n. terms t p = parse time R = n. runs k = size of run t c = compare+swap I = temp. file size t d = coding time f = n. index pointers I = file size

33 Compression Requirements Time requirements Runs must be sorted before writing to disk Must share space with dictionary shorter runs more runs to merge Requirements: T = Bt r + Ft p read,parse +R(1.2k log k)t c + I (t r + t d ) sort,compress,write log R (2I (t r + t d ) + ft c ) merge runs (I + I)(t r + t d ) recompress 26h Space requirements S = (1.1Mb Mb) d,f d,t pairs k+n bits runs for t-gaps (4 more) 1Gb

34 Outline

35 The merging process can be improved Instead of a 2-way merge, use an R-way merge One pass pointers coded/decoded only once Increased seek time (+1m, approximately)

36 Time Requirements T = Bt r + Ft p read,parse +R(1.2k log k)t c + I (t r + t d ) sort,compress,write f log R t c + I (t s /b + t r + t d ) merge runs I(t r + t d ) recompress 11h B = text size F = total n. terms R = n. runs t c = compare+swap t d = coding time t s = seek time I = file size t r = transfer rate t p = parse time k = size of run I = temp. file size f = n. index pointers b M/R = input buffer

37 Time Requirements T = Bt r + Ft p read,parse +R(1.2k log k)t c + I (t r + t d ) sort,compress,write f log R t c + I (t s /b + t r + t d ) merge runs I(t r + t d ) recompress 11h Space requirements One copy of the temporary file: 540Mb

38 Outline

39 Assume that runs are split into blocks of b bytes (with added padding, if necessary) R blocks can be read into memory and overwritten by output blocks of b bytes The output blocks contain the inverted lists During the process all values can be recoded using the most efficient code If a block overruns the vacant space, it can be appended to the end of the file At the end of the merge, blocks must be sorted Block-sorting can be done in linear time, using two b-byte buffers

40 heap output block 3 run 1, block 1 run 2, block 2 output block 1 1 run 1, block 2 output block 2 2 run 2, block 3 run 3, block 1 blocks in memory run 3, block 2 temporary file block table

41 Block-sorting algorithm For i 1 to nblocks, if i blocktable[i] then: 1 Read block i into memory 2 Set holding blocktable[i] 3 Set vacant i 4 While holding vacant do 1 Find j such that blocktable[j] = vacant 2 Copy block j to vacant 3 Set blocktable[vacant] vacant 4 Set vacant j 5 Write block in memory to vacant 6 Set blocktable[vacant] holding

42 Algorithm (I) 1 Initialization: Create an empty dictionary structure S; Create an empty temporary file T on disk; Set L S Set k (M L)/w, where w is the number of bytes required to store one t, d, f d,t record Set b 50Kb Set R 0

43 Algorithm (II) 2 For each document D d, 1 d N: 1 Read and parse D d 2 For each term t D d 1 Search S for t 2 If t is not in S insert it set L S set k (M L)/w 3 Add a record t, d, f d,t to the array of triples 3 If, at any stage, the array of triples contains k items 1 Sort the array (using quicksort) 2 Write the array coding t-gaps in unary, d-gaps with δ and f d,t values in unary 3 Add padding to complete a block of b bytes 4 Set R R If (b (R + 1) > M, set b b/2

44 Algorithm (III) 3 : 1 Read the first block from each run and add each block number to the free list 2 Build heap with R candidates (one from each run) 3 While the heap is non-empty 1 Remove the root 2 Add it to the output block, recoding 3 Replace it by the next candidate from the same run 4 Each time the output block is full 1 Use the free list to find a vacant space; if there is none, append it to the end of the file 2 Write the output block 3 Update the free list and block table 5 Each time an input block is empty 1 Read the next block from this run 2 Update the free list 4 Reorder the blocks 5 Truncate the inverted file

45 Algorithm Requirements Time Requirements T = Bt r + Ft p read,parse +R(1.2k log k)t c + I (t r + t d ) sort,compress,writ f log R t c + (I + I)(t s /b + t r + t d ) merge,recode 2I (t s /b + t r ) permute B = text size t r = transfer rate 11h F = total n. terms t p = parse time R = n. runs t c = compare+swap t d = coding time t s = seek time I = file size k = size of run I = temp. file size f = n. index pointers b M/R = input buffer

46 Algorithm Requirements Time Requirements T = Bt r + Ft p read,parse +R(1.2k log k)t c + I (t r + t d ) sort,compress,writ f log R t c + (I + I)(t s /b + t r + t d ) merge,recode 2I (t s /b + t r ) permute 11h Space Requirements Inverted file and temporary file occupy the same space: 150 Mb

47 Outline

48 Method Memory (Mb) Disk (Mb) Time (h) Dictionary-based (mem.) Dictionary-based (disk) d d compressed merge multiway merge

49 Alternative based inversion methods Using minimal perfect hashing to eliminate the lexicon from memory Using in-memory compression Partitioning the inversion process Lexicon-based partition Text-based partition All these imply several passes through the text

50 Questions?

Building an Inverted Index

Building an Inverted Index Building an Inverted Index Algorithms Memory-based Disk-based (Sort-Inversion) Sorting Merging (2-way; multi-way) 2 Memory-based Inverted Index Phase I (parse and read) For each document Identify distinct

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:

More information

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Recap: lecture 2 CS276A Information Retrieval

Recap: lecture 2 CS276A Information Retrieval Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards This time: Spell correction Soundex Index construction Index

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Algorithm Analysis Advanced Data Structure. Chung-Ang University, Jaesung Lee

Algorithm Analysis Advanced Data Structure. Chung-Ang University, Jaesung Lee Algorithm Analysis Advanced Data Structure Chung-Ang University, Jaesung Lee Priority Queue, Heap and Heap Sort 2 Max Heap data structure 3 Representation of Heap Tree 4 Representation of Heap Tree 5 Representation

More information

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012 Outline Database Management and Tuning Johann Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Unit 6 April 12, 2012 1 Acknowledgements: The slides are provided by Nikolaus Augsten

More information

Information Retrieval. Chap 8. Inverted Files

Information Retrieval. Chap 8. Inverted Files Information Retrieval Chap 8. Inverted Files Issues of Term-Document Matrix 500K x 1M matrix has half-a-trillion 0 s and 1 s Usually, no more than one billion 1 s Matrix is extremely sparse 2 Inverted

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Corso di Biblioteche Digitali

Corso di Biblioteche Digitali Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo

More information

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression

More information

Information Retrieval 6. Index compression

Information Retrieval 6. Index compression Ghislain Fourny Information Retrieval 6. Index compression Picture copyright: donest /123RF Stock Photo What we have seen so far 2 Boolean retrieval lawyer AND Penang AND NOT silver query Input Set of

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 Information Retrieval Lecture 6: Index Compression 6 Last Time: index construction Sort- based indexing Blocked Sort- Based Indexing Merge sort is effective

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

CS4500/5500 Operating Systems File Systems and Implementations

CS4500/5500 Operating Systems File Systems and Implementations Operating Systems File Systems and Implementations Yanyan Zhuang Department of Computer Science http://www.cs.uccs.edu/~yzhuang UC. Colorado Springs Recap of Previous Classes Processes and threads o Abstraction

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part VI Lecture 17, March 24, 2015 Mohammad Hammoud Today Last Two Sessions: DBMS Internals- Part V External Sorting How to Start a Company in Five (maybe

More information

Some Practice Problems on Hardware, File Organization and Indexing

Some Practice Problems on Hardware, File Organization and Indexing Some Practice Problems on Hardware, File Organization and Indexing Multiple Choice State if the following statements are true or false. 1. On average, repeated random IO s are as efficient as repeated

More information

Outlook. File-System Interface Allocation-Methods Free Space Management

Outlook. File-System Interface Allocation-Methods Free Space Management File System Outlook File-System Interface Allocation-Methods Free Space Management 2 File System Interface File Concept File system is the most visible part of an OS Files storing related data Directory

More information

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25 Indexing Jan Chomicki University at Buffalo Jan Chomicki () Indexing 1 / 25 Storage hierarchy Cache Main memory Disk Tape Very fast Fast Slower Slow (nanosec) (10 nanosec) (millisec) (sec) Very small Small

More information

Amusing algorithms and data-structures that power Lucene and Elasticsearch. Adrien Grand

Amusing algorithms and data-structures that power Lucene and Elasticsearch. Adrien Grand Amusing algorithms and data-structures that power Lucene and Elasticsearch Adrien Grand Agenda conjunctions regexp queries numeric doc values compression cardinality aggregation How are conjunctions implemented?

More information

Advance Indexing. Limock July 3, 2014

Advance Indexing. Limock July 3, 2014 Advance Indexing Limock July 3, 2014 1 Papers 1) Gurajada, Sairam : "On-line index maintenance using horizontal partitioning." Proceedings of the 18th ACM conference on Information and knowledge management.

More information

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes

More information

Divide and Conquer Algorithms

Divide and Conquer Algorithms Divide and Conquer Algorithms T. M. Murali February 19, 2009 Divide and Conquer Break up a problem into several parts. Solve each part recursively. Solve base cases by brute force. Efficiently combine

More information

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data

More information

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking Chapter 17 Disk Storage, Basic File Structures, and Hashing Records Fixed and variable length records Records contain fields which have values of a particular type (e.g., amount, date, time, age) Fields

More information

Efficiency vs. Effectiveness in Terabyte-Scale IR

Efficiency vs. Effectiveness in Terabyte-Scale IR Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system

More information

Introduction. Secondary Storage. File concept. File attributes

Introduction. Secondary Storage. File concept. File attributes Introduction Secondary storage is the non-volatile repository for (both user and system) data and programs As (integral or separate) part of an operating system, the file system manages this information

More information

Storage hierarchy. Textbook: chapters 11, 12, and 13

Storage hierarchy. Textbook: chapters 11, 12, and 13 Storage hierarchy Cache Main memory Disk Tape Very fast Fast Slower Slow Very small Small Bigger Very big (KB) (MB) (GB) (TB) Built-in Expensive Cheap Dirt cheap Disks: data is stored on concentric circular

More information

File System Internals. Jo, Heeseung

File System Internals. Jo, Heeseung File System Internals Jo, Heeseung Today's Topics File system implementation File descriptor table, File table Virtual file system File system design issues Directory implementation: filename -> metadata

More information

IS 709/809: Computational Methods in IS Research. Algorithm Analysis (Sorting)

IS 709/809: Computational Methods in IS Research. Algorithm Analysis (Sorting) IS 709/809: Computational Methods in IS Research Algorithm Analysis (Sorting) Nirmalya Roy Department of Information Systems University of Maryland Baltimore County www.umbc.edu Sorting Problem Given an

More information

University of Waterloo Midterm Examination Sample Solution

University of Waterloo Midterm Examination Sample Solution 1. (4 total marks) University of Waterloo Midterm Examination Sample Solution Winter, 2012 Suppose that a relational database contains the following large relation: Track(ReleaseID, TrackNum, Title, Length,

More information

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects

More information

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with

More information

CMSC424: Database Design. Instructor: Amol Deshpande

CMSC424: Database Design. Instructor: Amol Deshpande CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons

More information

EXTERNAL SORTING. Sorting

EXTERNAL SORTING. Sorting EXTERNAL SORTING 1 Sorting A classic problem in computer science! Data requested in sorted order (sorted output) e.g., find students in increasing grade point average (gpa) order SELECT A, B, C FROM R

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part VI Lecture 14, March 12, 2014 Mohammad Hammoud Today Last Session: DBMS Internals- Part V Hash-based indexes (Cont d) and External Sorting Today s Session:

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview

More information

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap to to Information Retrieval Index Construct Ruixuan Li Huazhong University of Science and Technology http://idc.hust.edu.cn/~rxli/ October, 2012 1 2 How to construct index? Computerese term document docid

More information

Programming II (CS300)

Programming II (CS300) 1 Programming II (CS300) Chapter 12: Sorting Algorithms MOUNA KACEM mouna@cs.wisc.edu Spring 2018 Outline 2 Last week Implementation of the three tree depth-traversal algorithms Implementation of the BinarySearchTree

More information

Query Answering Using Inverted Indexes

Query Answering Using Inverted Indexes Query Answering Using Inverted Indexes Inverted Indexes Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 2 Document-at-a-time Evaluation

More information

Kathleen Durant PhD Northeastern University CS Indexes

Kathleen Durant PhD Northeastern University CS Indexes Kathleen Durant PhD Northeastern University CS 3200 Indexes Outline for the day Index definition Types of indexes B+ trees ISAM Hash index Choosing indexed fields Indexes in InnoDB 2 Indexes A typical

More information

HICAMP Bitmap. A Space-Efficient Updatable Bitmap Index for In-Memory Databases! Bo Wang, Heiner Litz, David R. Cheriton Stanford University DAMON 14

HICAMP Bitmap. A Space-Efficient Updatable Bitmap Index for In-Memory Databases! Bo Wang, Heiner Litz, David R. Cheriton Stanford University DAMON 14 HICAMP Bitmap A Space-Efficient Updatable Bitmap Index for In-Memory Databases! Bo Wang, Heiner Litz, David R. Cheriton Stanford University DAMON 14 Database Indexing Databases use precomputed indexes

More information

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

CS60092: Informa0on Retrieval

CS60092: Informa0on Retrieval Introduc)on to CS60092: Informa0on Retrieval Sourangshu Bha1acharya Last lecture index construc)on Sort- based indexing Naïve in- memory inversion Blocked Sort- Based Indexing Merge sort is effec)ve for

More information

Fundamentals of Database Systems

Fundamentals of Database Systems Fundamentals of Database Systems Assignment: 4 September 21, 2015 Instructions 1. This question paper contains 10 questions in 5 pages. Q1: Calculate branching factor in case for B- tree index structure,

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

CS 310: Memory Hierarchy and B-Trees

CS 310: Memory Hierarchy and B-Trees CS 310: Memory Hierarchy and B-Trees Chris Kauffman Week 14-1 Matrix Sum Given an M by N matrix X, sum its elements M rows, N columns Sum R given X, M, N sum = 0 for i=0 to M-1{ for j=0 to N-1 { sum +=

More information

CSE 190D Spring 2017 Final Exam Answers

CSE 190D Spring 2017 Final Exam Answers CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant

More information

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started

More information

MG4J: Managing Gigabytes for Java. MG4J - intro 1

MG4J: Managing Gigabytes for Java. MG4J - intro 1 MG4J: Managing Gigabytes for Java MG4J - intro 1 Managing Gigabytes for Java Schedule: 1. Introduction to MG4J framework. 2. Exercitation: try to set up a search engine on a particular collection of documents.

More information

File Systems. CS170 Fall 2018

File Systems. CS170 Fall 2018 File Systems CS170 Fall 2018 Table of Content File interface review File-System Structure File-System Implementation Directory Implementation Allocation Methods of Disk Space Free-Space Management Contiguous

More information

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Index Compression. David Kauchak cs160 Fall 2009 adapted from: Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?

More information

Chapter 18 Indexing Structures for Files

Chapter 18 Indexing Structures for Files Chapter 18 Indexing Structures for Files Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Disk I/O for Read/ Write Unit for Disk I/O for Read/ Write: Chapter 18 One Buffer for

More information

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

PS2 out today. Lab 2 out today. Lab 1 due today - how was it? 6.830 Lecture 7 9/25/2017 PS2 out today. Lab 2 out today. Lab 1 due today - how was it? Project Teams Due Wednesday Those of you who don't have groups -- send us email, or hand in a sheet with just your

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 17

More information

Chapter 1 Disk Storage, Basic File Structures, and Hashing.

Chapter 1 Disk Storage, Basic File Structures, and Hashing. Chapter 1 Disk Storage, Basic File Structures, and Hashing. Adapted from the slides of Fundamentals of Database Systems (Elmasri et al., 2003) 1 Chapter Outline Disk Storage Devices Files of Records Operations

More information

DIVIDE AND CONQUER ALGORITHMS ANALYSIS WITH RECURRENCE EQUATIONS

DIVIDE AND CONQUER ALGORITHMS ANALYSIS WITH RECURRENCE EQUATIONS CHAPTER 11 SORTING ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++, GOODRICH, TAMASSIA AND MOUNT (WILEY 2004) AND SLIDES FROM NANCY M. AMATO AND

More information

Information Retrieval. Lecture 10 - Web crawling

Information Retrieval. Lecture 10 - Web crawling Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the

More information

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures

More information

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes? Course work Introduc)on to Informa(on Retrieval Problem set 1 due Thursday Programming exercise 1 will be handed out today CS276: Informa)on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan

More information

Keyword Search Using General Form of Inverted Index

Keyword Search Using General Form of Inverted Index Keyword Search Using General Form of Inverted Index Mrs. Pratiksha P. Nikam Prof. Srinu Dharavath Mr. Kunal Gawande Lecturer Professor(Guide) Software developer GSMCOE, Pune,india GSMCOE, Pune,india Mumbai,

More information

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007 Information Retrieval Lecture 3 - Index compression Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Dictionary and inverted index:

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many

More information

CS-245 Database System Principles

CS-245 Database System Principles CS-245 Database System Principles Midterm Exam Summer 2001 SOLUIONS his exam is open book and notes. here are a total of 110 points. You have 110 minutes to complete it. Print your name: he Honor Code

More information

Outline of the course

Outline of the course Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library

More information

Sorting a File of Records

Sorting a File of Records Sorting a File of Records 1 Consider the problem of sorting a large file, stored on disk, containing a large number of logical records. If the file is very large at all then it will be impossible to load

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query

More information

C has been and will always remain on top for performancecritical

C has been and will always remain on top for performancecritical Check out this link: http://spectrum.ieee.org/static/interactive-the-top-programminglanguages-2016 C has been and will always remain on top for performancecritical applications: Implementing: Databases

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm

More information

Introduc)on to. CS60092: Informa0on Retrieval

Introduc)on to. CS60092: Informa0on Retrieval Introduc)on to CS60092: Informa0on Retrieval Ch. 4 Index construc)on How do we construct an index? What strategies can we use with limited main memory? Sec. 4.1 Hardware basics Many design decisions in

More information

NET/JRF-COMPUTER SCIENCE & APPLICATIONS. Time: 01 : 00 Hour Date : M.M. : 50

NET/JRF-COMPUTER SCIENCE & APPLICATIONS. Time: 01 : 00 Hour Date : M.M. : 50 1 NET/JRF-COMPUTER SCIENCE & APPLICATIONS UNIT TEST : DATA STRUCTURE Time: 01 : 00 Hour Date : 02-06-2017 M.M. : 50 INSTRUCTION: Attempt all the 25 questions. Each question carry TWO marks. 1. Consider

More information

COSEQUENTIAL PROCESSING (SORTING LARGE FILES)

COSEQUENTIAL PROCESSING (SORTING LARGE FILES) 10 COSEQUENTIAL PROCESSING (SORTING LARGE FILES) Copyright 2004, Binnur Kurt Content Cosequential Processing and Multiway Merge Sorting Large Files (External Sorting) 255 Cosequential Processing & Multiway

More information

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] cs186-

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] cs186- Midterm 1: CS186, Spring 2016 Name: Class Login: cs186- You should receive 1 double-sided answer sheet and an 11-page exam. Mark your name and login on both sides of the answer sheet, and in the blanks

More information

Memory: Page Table Structure. CSSE 332 Operating Systems Rose-Hulman Institute of Technology

Memory: Page Table Structure. CSSE 332 Operating Systems Rose-Hulman Institute of Technology Memory: Page Table Structure CSSE 332 Operating Systems Rose-Hulman Institute of Technology General address transla+on CPU virtual address data cache MMU Physical address Global memory Memory management

More information

DATA STRUCTURES AND ALGORITHMS

DATA STRUCTURES AND ALGORITHMS DATA STRUCTURES AND ALGORITHMS Sorting algorithms External sorting, Search Summary of the previous lecture Fast sorting algorithms Quick sort Heap sort Radix sort Running time of these algorithms in average

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

Sorting. Bubble Sort. Pseudo Code for Bubble Sorting: Sorting is ordering a list of elements.

Sorting. Bubble Sort. Pseudo Code for Bubble Sorting: Sorting is ordering a list of elements. Sorting Sorting is ordering a list of elements. Types of sorting: There are many types of algorithms exist based on the following criteria: Based on Complexity Based on Memory usage (Internal & External

More information

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University File System Internals Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Today s Topics File system implementation File descriptor table, File table

More information

Programming II (CS300)

Programming II (CS300) 1 Programming II (CS300) Chapter 12: Sorting Algorithms MOUNA KACEM mouna@cs.wisc.edu Spring 2018 Outline 2 Last week Implementation of the three tree depth-traversal algorithms Implementation of the BinarySearchTree

More information

Storage and File Structure

Storage and File Structure CSL 451 Introduction to Database Systems Storage and File Structure Department of Computer Science and Engineering Indian Institute of Technology Ropar Narayanan (CK) Chatapuram Krishnan! Summary Physical

More information

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Chapter 13 Disk Storage, Basic File Structures, and Hashing. Chapter 13 Disk Storage, Basic File Structures, and Hashing. Copyright 2004 Pearson Education, Inc. Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files

More information

Hashing IV and Course Overview

Hashing IV and Course Overview Date: April 5-6, 2001 CSI 2131 Page: 1 Hashing IV and Course Overview Other Collision Resolution Techniques 1) Double Hashing The first hash function determines the home address If the home address is

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation

More information

CS122 Lecture 15 Winter Term,

CS122 Lecture 15 Winter Term, CS122 Lecture 15 Winter Term, 2014-2015 2 Index Op)miza)ons So far, only discussed implementing relational algebra operations to directly access heap Biles Indexes present an alternate access path for

More information