MG4J: Managing Gigabytes for Java. MG4J - intro 1

Similar documents
MG4J: Managing Gigabytes for Java. Exercise. Adriano Fazzone DIAG Department Sapienza University of Rome. MG4J - intro 1

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

File Directories Associated with any file management system and collection of files is a file directories The directory contains information about

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Query Processing and Alternative Search Structures. Indexing common words

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

Query Answering Using Inverted Indexes

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Informatica Data Explorer Performance Tuning

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Compressed Collections for Simulated Crawling

A Security Model for Multi-User File System Search. in Multi-User Environments

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Static Pruning of Terms In Inverted Files

Compressed Collections for Simulated Crawling

COMP6237 Data Mining Searching and Ranking

Efficiency vs. Effectiveness in Terabyte-Scale IR

Caching and Buffering in HDF5

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Information Retrieval Spring Web retrieval

Full-Text Indexing For Heritrix

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

modern database systems lecture 4 : information retrieval

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing and Searching

THE WEB SEARCH ENGINE

Preview. Memory Management

1 o Semestre 2007/2008

Memory management. Requirements. Relocation: program loading. Terms. Relocation. Protection. Sharing. Logical organization. Physical organization

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Clustering Techniques A Technical Whitepaper By Lorinda Visnick

Recap: lecture 2 CS276A Information Retrieval

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Indexing and Searching

CDP Data Center Console User Guide CDP Data Center Console User Guide Version

Introduction to Information Retrieval

Using Graphics Processors for High Performance IR Query Processing

CACHE-OBLIVIOUS MAPS. Edward Kmett McGraw Hill Financial. Saturday, October 26, 13

IO-Top-k at TREC 2006: Terabyte Track

Chapter 2. Architecture of a Search Engine

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Quick-Sort. Quick-Sort 1

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan

Information Retrieval II

ECE 122. Engineering Problem Solving Using Java

Information Retrieval

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

ECE 598 Advanced Operating Systems Lecture 10

Implementation should be efficient. Provide an abstraction to the user. Abstraction should be useful. Ownership and permissions.

File Systems Ch 4. 1 CS 422 T W Bennet Mississippi College

Information Retrieval

I/O in scientific applications

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Outlook. File-System Interface Allocation-Methods Free Space Management

Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

Some Practice Problems on Hardware, File Organization and Indexing

CrownPeak Playbook CrownPeak Search

The Topic Specific Search Engine

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Information Retrieval. Lecture 10 - Web crawling

DATABASE SYSTEMS. Database programming in a web environment. Database System Course, 2016

Information Retrieval May 15. Web retrieval

FUNCTIONALLY OBLIVIOUS (AND SUCCINCT) Edward Kmett

Quick-Sort. Quick-sort is a randomized sorting algorithm based on the divide-and-conquer paradigm:

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015

Data-Intensive Distributed Computing

Sorting Goodrich, Tamassia Sorting 1

Lecture 15. Lecture 15: Bitmap Indexes

How to Split PDF files with AutoSplit

Quick-Sort fi fi fi 7 9. Quick-Sort Goodrich, Tamassia

Advanced Database Systems

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CS47300 Web Information Search and Management

More about The Competition

Distributed computing: index building and use

Information Retrieval

CISC689/ Information Retrieval Midterm Exam

Directory Structure and File Allocation Methods

Quick Sort. CSE Data Structures May 15, 2002

Document Representation : Quiz

arxiv: v2 [cs.ds] 30 Sep 2016

SSIM Collection & Archiving Infrastructure Scaling & Performance Tuning Guide

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

C13: Files and Directories: System s Perspective

Data Stream Processing

Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Data Structure and Algorithm Homework #3 Due: 2:20pm, Tuesday, April 9, 2013 TA === Homework submission instructions ===

Transcription:

MG4J: Managing Gigabytes for Java MG4J - intro 1

Managing Gigabytes for Java Schedule: 1. Introduction to MG4J framework. 2. Exercitation: try to set up a search engine on a particular collection of documents. MG4J - intro 2

MG4J: introduction (1) Free full-text search engine for large document collections Written in Java Developed by the Department of Computer Science (University of Milan) MG4J home: http://mg4j.di.unimi.it/ Documentation: http://mg4j.di.unimi.it/docs-big/ Manual: http://mg4j.di.unimi.it/man-big/manual.pdf MG4J - intro 3

MG4J: introduction (2) MG4J can be used for indexing and querying a large collection of documents INPUT: set of documents with the same number and type of fields The fields can be textual or virtual (example of virtual fields are the anchors of HTML pages) OUTPUT: essentially an inverted index MG4J - intro 4

MG4J: To create and to query an inverted index with MG4j we have to complete these 3 steps: 1. Create a collection of documents. 2. Create an inverted index on the created collection. 3. Query the created inverted index. MG4J - intro 5

Document collection: Document factory The document factory module is responsible for turning raw byte sequences into documents. In particular, the factory transforms a sequence of bytes into a number of fields Every field has a name and a type Textual fields are words that appear in the title and in the text of a document The words are processed by a term processor The dictionary is the set of terms that are indexed MG4J - intro 6

Example of index (1) MG4J - intro 7

Example of index (2) An occurrence is a group of three numbers (t,d,p) (t,d,p) means that the term whose index is t appears in document d at position p Inverted lists can be obtained by re-sorting the occurrences in increasing term order, so that occurrences relative to the same term appear consecutively MG4J - intro 8

Building batches MG4J scans the whole document collection producing batches Batches are sub-indices limited to a subset of documents, and they are created each time the number of indexed documents reaches a user-provided threshold, or when the available memory is too little Indexer operations: 1. scan all documents and extract the occurrences 2. gather new terms, sort them in alphabetical ordering, and renumber all the occurrences 3. (if required) renumber and sort the documents in increasing order 4. sort (at least partially) the occurrences in increasing term order 5. create the subindex of the current batch of occurrences MG4J - intro 9

Batch combination Once the batches are created, they are combined in a single index MG4J has three type of index combination: Concatenation Merging Pasting MG4J - intro 10

Batch combination: concatenation The first document of the second index is renumbered to the number of documents of the first index, and the others follow; the first document of the third index is renumbered to the sum of number of documents of the first and second index, and so on The resulting index is identical to the index that would be produced by indexing the concatenation of document sequences producing each index MG4J - intro 11

Batch combination: merging Assuming that each index contains a separate subset of documents, with non-overlapping number, we can merge the lists accordingly In case a document appears in two indexes, the merge operation is stopped! Note that no renumbering is performed This is the kind of combination that is applied to batches when documents have been renumbered, and each batch contains potentially non-consecutive document numbers MG4J - intro 12

Batch combination: pasting Each index is assumed to index a (possibly empty) part of a document For each term and document, the positions of the term in the document are gathered (and possibly suitably renumbered) If the inputs that have been indexed are text files with newline as separator, the resulting index is identical to the one that would be obtained by applying the UNIX command paste to the text files This is the kind of combination that is applied to virtual documents MG4J - intro 13

Time and space requirements Scanning phase is time and space consuming MG4J is able to work with little memory, but more memory allows to create larger batches which can be merged quickly J The indexer produces a number of subindexes which is proportional to the number of occurrences But combining subindexes is a resource consuming activity. Good strategy: Try to create batches as large as possible and check the logs to avoid to work with an almost full heap (this slows down Java) MG4J - intro 14

Querying the index Once the index is built we can query it using a web server MG4J allows to use command line and the web browser Flexibility to improve query answer time: a portion of the index can loaded into main memory (caching) MG4J - intro 15

Sophisticated query: scorer MG4J provides very sophisticated query tuning To use this features, we must use the command line interface. All settings are then used for subsequent queries (sticky) MG4J allows to choose the scorer for the ranking of the results of a query Scorers assign a score to the document, depending on some criterion (e.g., the frequency of the term in the document) Documents are ordered using the scores, so that more relevant documents appears at the beginning of the result list MG4J - intro 16

Virtual fields A virtual field produces pieces of text that refer to other documents (possibly belonging to the collection) Referrer: the document that is referring to another document Referee: the document to which a piece of text of the Referrer is referring to Intuitively, the Referrer gives us information about the Referee The Referrer produces in a virtual field a number of fragments of text, each referring to a Referee The content of a virtual field is a list of pairs made by the piece of text (called virtual fragment) and by some string that is aimed at representing the Referee (called the document spec) MG4J - intro 17

Virtual fields: HTMLDocumentFactory In the case of the HTML document: the document spec is a URL (as specified in the href attribute) the virtual fragment is the content of the anchor element and some surrounding text (anchor context) The HTMLDocumentFactory produces the pairs (document spec, virtual fragment) MG4J - intro 18

Virtual fields: document resolver The document factory just produces fields out of a HTML document There is no fixed way to map document spec into actual references to documents in the collection. This is resolved by the notion of document resolver The document resolver maps the document spec, which are produced by some document factory, into actual references to the documents in the collection MG4J - intro 19

Virtual gap All the virtual fragments that refer to a given document of the collection are like a single text, called the virtual text Virtual fragments coming from different anchors are concatenated, and this may produce false positive results To avoid such kinds of false positives, we can use virtual gap The virtual gap is a positive integer, representing the virtual space left between different virtual fragments MG4J - intro 20

Payload-based index For metadata (dates, integers, etc.) MG4J provides a special kind of index called payload-based index Payload-based index leverages the structure of the textbased index, with the difference that each posting has a payload Searching a payload-based index is different from searching a text-based index. For example, instead of keyword queries and Boolean queries, we can use range queries [..] Assuming to have the field date, we can use the following query: [ 20/2/2007.. 23/2/2007 ] to get all the documents whose date is between Feb. 20 and Feb. 23, 2007 MG4J - intro 21

Performance (1) MG4J provides a great flexibility for the index construction, and all the choices have a significant impact on performance If we have enough memory, building large batches is a good trick We can choose the codes used for compression: Nonparametric codes are quicker than parametric codes. The default choice is γ-code for frequencies and counts, while δ-code for pointers and positions. Max speed can be obtained by using γ-code for everything MG4J - intro 22

Performance (2) Another trick is discarding what is not necessary: pointers, counts, positions, etc. For example, if we use BM25 or TF/IDF scoring, we do not need to store positions in the index (use -c POSITIONS:none) Indexes contain skipping structures to skip index entries. However, skipping structures introduce a slight overhead when scanning sequentially a list (use --noskips option to disable skipping structure) MG4J - intro 23

Performance (3) The index can be: read from disk memory mapped directly loaded into main memory The three solutions work with increasing speed and increased main memory usage The default is to read the index from disk We can add suitable options to the index URI (e.g. mapped=1 or inmemory=1) to use the other solutions MG4J - intro 24

Partitioning Partitioning an index means dividing an index using some criterion Partitioning can be: Documental (split using documents) Lexical (split using terms) Personalized Improving performance: we can partition our index, and once we have several subindexes, we can decide which ones will be loaded into main memory, which ones will be mapped, etc. MG4J - intro 25

Your turn ;) TECHNICAL REQUIREMENTS: UNIX-like Operating System and Java (>=6) Document collection and the libraries are available at: http://www.diag.uniroma1.it/~fazzone Download and extract htmldis.tar.gz (html pages of the DIAG) Download and extract webir_lab_lib.zip (yes, another time K ) Edit the file set-my-classpath.sh: replace your_directory with the path of the folder containing all the.jar files (lib folder) Set the CLASSPATH: source set-my-classpath.sh MG4J - intro 26

Your turn ;) Follow the instructions inside mg4j-exercise.pdf to create an Inverted Index and to perform some queries (try the different scorers). If you have problems, read the MG4J manual: http://mg4j.di.unimi.it/man-big/manual.pdf If you want, you can create your own document collection (ask to me how to develop a simple crawler), build the inverted index, then submit some queries and try the different scorers. MG4J - intro 27