The anatomy of a large-scale l small search engine: Efficient index organization and query processing
|
|
- Kristian Turner
- 6 years ago
- Views:
Transcription
1 The anatomy of a large-scale l small search engine: Efficient index organization and query processing Simon Jonassen Department of Computer and Information Science Norwegian University it of Science and dtechnology TDT4215 Web Intelligence NTNU 17 March, 2011 State of the Art 1. Design a self-skipping index structure specifically for NewPForDelta compression. 2. Provide an efficient query processing method for disjunctive (OR) queries.
2 Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list Search Engine Basics
3 Inverted Index Approach to IR apple.com Query Processing Modes Document-At-A-Time (DAAT) A document has to be fully matched and scored for all of the query terms before any other document is considered. Term-At-A-Time (TAAT) A term s s posting list has to be fully processed before any other term is considered.
4 Query Matching Modes Conjunctive (AND) queries A document has to match ALL of the query terms. Disjunctive (OR) queries A document has to match ANY of the query terms. (Normally are more time-consuming than AND queries) Statistical Similarity Scoring Models Cosine similarity TFxIDF Robertson TFxIDF Okapi BM-25
5 Statistical Similarity Scoring Models Cosine, TFxIDF, Okapi BM25, etc. Term frequency: number of times a term occurs in a document Document frequency/posting list length: number of documents a term occurs in Collection frequency: total number of occurrences of a term within the document collection Key/query frequency: number of occurrences in the query Document length: number of tokens in the document Total number of documents Total number of unique terms Total number of postings (posting list entries) Total number of tokens (word occurrences) Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list
6 Main Index Components and Organization Collection statistics Total numbers of documents, unique terms, tokens and postings Document dictionary For each document: document ID, name, length, URL. Lexicon For each term: term ID, term string, document and collection frequencies, pointer to posting list Inverted file/posting lists For each term: Term ID, document IDs and term frequencies Index organization Collection statistics can be stored as a.property (or.txt) file and read at start-up. Both document dictionary and lexicon can be stored as two ordered sets (arrays) of constant size records, or put into two different B-trees or similar. DocDict (28B entries): ID (int, 4B), DocNo (20B), NumTokens (int, 4B). Lexicon (40B entries): Term (20B), ID (int, 4B) DocFreq (int, 4B), TermFreq (int, 4B), EndPtr (long, 8B).
7 Index organization inverted file Basic posting list: <termid <docid freq>*>* Alternatively, ti l if we want to store positions as well: <termid <docid freq <pos>*>*>* <docid freq> postings are normally ordered by increasing docid. Simplest way to reduce space: dgaps og delta-coding store differences between docids rather than docids. Frequeccies normally represent the number of occurrences in a document and stored as an integer. Inverted file compression Many+ different methods that can be separated between: parametric/non-parametric dictionary/arithmetic adaptive/non-adaptive p bit/byte/word-level single-value/chunk-based etc. We discuss a few.
8 Inverted file compression Unary Store k as k-1 1 s and a final 0 Elias Gamma and Delta codes Gamma: store k as k1=1+floor(log_2(k)) in unary, and k-2(k1-1) in binary with k-1 bits Delta: stores k1 in Gamma and remainder in binary. Other methods: Golumb, Rice, Interpolative codes, etc. Space efficient! More complex Often more time-consuming Has to decompress all previous values to get to a particular value or store more information in order to skip. Inverted file compression VByte Uses 7 bits of a byte to store data and 1 bit to define boundaries. Simplest byte-level coding, very time efficient, less space efficient.
9 Inverted file compression Simple9 (word-level coding) Uses 4 bits to store a selector code, and remaining 28 bits to store data. According to [AM05], as fast as VByte and has better compression. Other methods: Simple16 Carryover-methods. Inverted file compression PFor and PForDelta Byte-level Stores chunks of 128 entries. Super-scalar, loop-unrolling, almost branch-free, CPU and cache-efficient. Must-have of high-performance search engines (Lucene, Linked-In). Main drawback: Compulsory (forced) exceptions.
10 Inverted file compression NewPFor and NewPForDelta instead of exception offsets store just all the bits we can store. store the overflow bits and exception offsets as two Simple9 coded arrays. Other methods: OptPFor, PDict, 64bit versions of methods, etc. Basic inverted file
11 Basic inverted file - buffering Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list
12 Skipping We may store skip-pointers in order to jump over some data. Simple and efficient i for AND-queries. OR-queries? Skip-Lists Moffat et al. have found evaluated optimal skip-distances for one and multiple-level skipping. Does not take compression nor buffering into account Boldi and Vigna: skip-towers.
13 Inverted file design for efficient skipping
14 Skipping Our inverted file iterator does following operations skipto(docid) skip to the first doc having docid equal to or larger than the specified next() go to next element getdocid() getfreq()
15 Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list TAAT Processing
16 TAAT Processing TAAT Processing
17 TAAT Processing Remaining pointers Processed pointers Accumulator count increase TAAT Processing AND
18 DAAT Processing DAAT Processing
19 DAAT Processing:OR+Skipping An example DAAT Processing:OR+Skipping Requirement Set Property: only posting lists with accumulated maximum scores greater or equal to current least scored result can initiate candidates. Partial Ranking Property: If the current partial score + remaining acc.max. score is less than the current least scored result, we can discard this candidate
20 DAAT Processing:OR+Skipping Partial Ranking Property Requirement Set Property Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list
21 Experimental Results: Index 426GB TREC GOV2 document corpus Stemming (snowball) and stop-word processed 15.4 million unique terms, million documents, 4.7 billion pointers, 16.3 billion tokens. our final index without skipping is 5.977GB Skipping adds 87.1MB (1.42% increase) million posting lists with zero skip-levels posting lists with one skip level, with two levels only 377 with three levels. Corresponding index built by the Terrier Search Engine (v 2.1) is 8.6GB (bit-level compression, no skipping) Experimental Results: Querying Terabyte Track 05 Efficiency Topics First queries with number of matching terms greater than one. Platform Intel Core 2 Quad 2.66GHz processor, 8GB RAM, 1TB 7200RPM SATA2 GNU/Linux, Java 6. Other: we use 16KB blocks for buffering.
22 Experimental Results: Querying Experimental Results: Querying
23 Experimental Results: Querying Experimental Results: Querying
24 Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list Conclusions We have designed an efficient skipping skipping structure for a chunk-wise compressed index. We have designed, implemented and evaluated two efficient algorithms that apply index skipping to disjunctive queries. Both methods achieve more than 3.5 times speed-up compared to a full evaluation.
25 Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list Inverted Indexes and Query Optimization [AM05] Vo Ngoc Anh and Alistair Moat. Inverted index compression using word-aligned binary codes. Inf. Retr., 8: , January [LMWZ05] N. Lester, A. Moat, W. Webber, and J. Zobel. Space-limited ranked query evaluation using adaptive pruning. In Proc. WISE, pages , [OAP+06] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma. Terrier: A High Performance and Scalable Information Retrieval Platform. In OSIR Workshop, SIGIR, [STC05] T. Strohman, H. Turtle, and W. Croft. Optimization strategies for complex queries. In Proc. SIGIR, pages ACM, [TF95] H. Turtle and J. Flood. Query evaluation: strategies and optimizations. Inf. Process. Manage., 31(6): , [ZM06] J. Zobel and A. Moat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006.
26 Inverted File Compression [AM10] Vo Ngoc Anh and Alistair Moat. Index compression using 64-bit words. Softw. Pract. Exper., 40: , February [DHYS08] S. Ding, J. He, H. Yan, and T. Suel. Using graphics processors for high-performance IR query processing. In Proc. WWW, pages ACM, [WMB99] Ian H. Witten, Alistair Moat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA, 2. edition, [YDS09] H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized i document ordering. In Proc. WWW, pages ACM, [ZHNB06] M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar ramcpu-cache compression. In Proc. ICDE, pages 59{. IEEE Computer Society, [ZLS08] J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. WWW, pages ACM, Skipping [BC07] S. B uttcher and C. Clarke. Index compression is good, especially for random access. In Proc. CIKM, pages ACM, [BV05] P. Boldi and S. Vigna. Compressed perfect embedded skip lists for quick inverted-index index lookups. In Proc. SPIRE, pages Springer- Verlag, [CLMP08] F. Chierichetti, S. Lattanzi, F. Mari, and A. Panconesi. On placing skips optimally in expectation. In Proc. WSDM, pages ACM, [MZ96] A. Mo at and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst., 14(4): , 1996.
27 Thank you! Efficient Compressed Inverted Index Skipping for Disjunctive Text-Queries. Simon Jonassen and Svein Erik Bratsberg. Proceedings of the 33rd European Conference on Information Retrieval (ECIR 11), Dublin, Ireland, April 2011.
V.2 Index Compression
V.2 Index Compression Heap s law (empirically observed and postulated): Size of the vocabulary (distinct terms) in a corpus E[ distinct terms in corpus] n with total number of term occurrences n, and constants,
More informationEfficient Dynamic Pruning with Proximity Support
Efficient Dynamic Pruning with Proximity Support Nicola Tonellotto Information Science and Technologies Institute National Research Council Via G. Moruzzi, 56 Pisa, Italy nicola.tonellotto@isti.cnr.it
More informationA Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval
A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval Simon Jonassen and Svein Erik Bratsberg Department of Computer and Information Science Norwegian University of
More informationDistributing efficiently the Block-Max WAND algorithm
Available online at www.sciencedirect.com Procedia Computer Science (23) International Conference on Computational Science, ICCS 23 Distributing efficiently the Block-Max WAND algorithm Oscar Rojas b,
More informationCompressing Inverted Index Using Optimal FastPFOR
[DOI: 10.2197/ipsjjip.23.185] Regular Paper Compressing Inverted Index Using Optimal FastPFOR Veluchamy Glory 1,a) Sandanam Domnic 1,b) Received: June 20, 2014, Accepted: November 10, 2014 Abstract: Indexing
More informationDistributing efficiently the Block-Max WAND algorithm
Available online at www.sciencedirect.com Procedia Computer Science 8 (23 ) 2 29 International Conference on Computational Science, ICCS 23 Distributing efficiently the Block-Max WAND algorithm Oscar Rojas
More informationEfficient Query Processing in Distributed Search Engines
Simon Jonassen Efficient Query Processing in Distributed Search Engines Thesis for the degree of Philosophiae Doctor Trondheim, January 2013 Norwegian University of Science and Technology Faculty of Information
More informationModeling Static Caching in Web Search Engines
Modeling Static Caching in Web Search Engines Ricardo Baeza-Yates 1 and Simon Jonassen 2 1 Yahoo! Research Barcelona Barcelona, Spain 2 Norwegian University of Science and Technology Trondheim, Norway
More informationExploiting Progressions for Improving Inverted Index Compression
Exploiting Progressions for Improving Inverted Index Compression Christos Makris and Yannis Plegas Department of Computer Engineering and Informatics, University of Patras, Patras, Greece Keywords: Abstract:
More informationUsing Graphics Processors for High Performance IR Query Processing
Using Graphics Processors for High Performance IR Query Processing Shuai Ding Jinru He Hao Yan Torsten Suel Polytechnic Inst. of NYU Polytechnic Inst. of NYU Polytechnic Inst. of NYU Yahoo! Research Brooklyn,
More informationOptimized Top-K Processing with Global Page Scores on Block-Max Indexes
Optimized Top-K Processing with Global Page Scores on Block-Max Indexes Dongdong Shan 1 Shuai Ding 2 Jing He 1 Hongfei Yan 1 Xiaoming Li 1 Peking University, Beijing, China 1 Polytechnic Institute of NYU,
More informationDistribution by Document Size
Distribution by Document Size Andrew Kane arkane@cs.uwaterloo.ca University of Waterloo David R. Cheriton School of Computer Science Waterloo, Ontario, Canada Frank Wm. Tompa fwtompa@cs.uwaterloo.ca ABSTRACT
More informationCluster based Mixed Coding Schemes for Inverted File Index Compression
Cluster based Mixed Coding Schemes for Inverted File Index Compression Jinlin Chen 1, Ping Zhong 2, Terry Cook 3 1 Computer Science Department Queen College, City University of New York USA jchen@cs.qc.edu
More informationComparative Analysis of Sparse Matrix Algorithms For Information Retrieval
Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval Nazli Goharian, Ankit Jain, Qian Sun Information Retrieval Laboratory Illinois Institute of Technology Chicago, Illinois {goharian,ajain,qian@ir.iit.edu}
More informationAn Experimental Study of Index Compression and DAAT Query Processing Methods
An Experimental Study of Index Compression and DAAT Query Processing Methods Antonio Mallia, Micha l Siedlaczek, and Torsten Suel Computer Science and Engineering, New York University, New York, US {antonio.mallia,michal.siedlaczek,torsten.suel}@nyu.edu
More informationCompressing and Decoding Term Statistics Time Series
Compressing and Decoding Term Statistics Time Series Jinfeng Rao 1,XingNiu 1,andJimmyLin 2(B) 1 University of Maryland, College Park, USA {jinfeng,xingniu}@cs.umd.edu 2 University of Waterloo, Waterloo,
More informationEfficient Execution of Dependency Models
Efficient Execution of Dependency Models Samuel Huston Center for Intelligent Information Retrieval University of Massachusetts Amherst Amherst, MA, 01002, USA sjh@cs.umass.edu W. Bruce Croft Center for
More informationInformation Retrieval II
Information Retrieval II David Hawking 30 Sep 2010 Machine Learning Summer School, ANU Session Outline Ranking documents in response to a query Measuring the quality of such rankings Case Study: Tuning
More informationStatic Index Pruning for Information Retrieval Systems: A Posting-Based Approach
Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach Linh Thai Nguyen Department of Computer Science Illinois Institute of Technology Chicago, IL 60616 USA +1-312-567-5330 nguylin@iit.edu
More informationEfficient Document Retrieval in Main Memory
Efficient Document Retrieval in Main Memory Trevor Strohman strohman@cs.umass.edu Department of Computer Science University of Massachusetts Amherst, MA 01003 W. Bruce Croft croft@cs.umass.edu ABSTRACT
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationIndex Compression. David Kauchak cs160 Fall 2009 adapted from:
Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?
More information2 Partitioning Methods for an Inverted Index
Impact of the Query Model and System Settings on Performance of Distributed Inverted Indexes Simon Jonassen and Svein Erik Bratsberg Abstract This paper presents an evaluation of three partitioning methods
More informationIntroduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline
Introduction to Information Retrieval (COSC 488) Spring 2012 Nazli Goharian nazli@cs.georgetown.edu Course Outline Introduction Retrieval Strategies (Models) Retrieval Utilities Evaluation Indexing Efficiency
More informationCompression, SIMD, and Postings Lists
Compression, SIMD, and Postings Lists Andrew Trotman Department of Computer Science University of Otago Dunedin, New Zealand andrew@cs.otago.ac.nz ABSTRACT The three generations of postings list compression
More informationEfficient Decoding of Posting Lists with SIMD Instructions
Journal of Computational Information Systems 11: 24 (2015) 7747 7755 Available at http://www.jofcis.com Efficient Decoding of Posting Lists with SIMD Instructions Naiyong AO 1, Xiaoguang LIU 2, Gang WANG
More informationPerformance Improvements for Search Systems using an Integrated Cache of Lists+Intersections
Performance Improvements for Search Systems using an Integrated Cache of Lists+Intersections Abstract. Modern information retrieval systems use sophisticated techniques for efficiency and scalability purposes.
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing
More informationVariable Length Integers for Search
7:57:57 AM Variable Length Integers for Search Past, Present and Future Ryan Ernst A9.com 7:57:59 AM Overview About me Search and inverted indices Traditional encoding (Vbyte) Modern encodings Future work
More informationIO-Top-k at TREC 2006: Terabyte Track
IO-Top-k at TREC 2006: Terabyte Track Holger Bast Debapriyo Majumdar Ralf Schenkel Martin Theobald Gerhard Weikum Max-Planck-Institut für Informatik, Saarbrücken, Germany {bast,deb,schenkel,mtb,weikum}@mpi-inf.mpg.de
More informationAn Incremental Approach to Efficient Pseudo-Relevance Feedback
An Incremental Approach to Efficient Pseudo-Relevance Feedback ABSTRACT Hao Wu Department of Electrical and Computer Engineering University of Delaware Newark, DE USA haow@udel.edu Pseudo-relevance feedback
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 Information Retrieval Lecture 6: Index Compression 6 Last Time: index construction Sort- based indexing Blocked Sort- Based Indexing Merge sort is effective
More information230 Million Tweets per day
Tweets per day Queries per day Indexing latency Avg. query response time Earlybird - Realtime Search @twitter Michael Busch @michibusch michael@twitter.com buschmi@apache.org Earlybird - Realtime Search
More informationLecture 5: Information Retrieval using the Vector Space Model
Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query
More informationCompression of Inverted Indexes For Fast Query Evaluation
Compression of Inverted Indexes For Fast Query Evaluation Falk Scholer Hugh E. Williams John Yiannis Justin Zobel School of Computer Science and Information Technology RMIT University, GPO Box 2476V Melbourne,
More informationThe role of index compression in score-at-a-time query evaluation
DOI 10.1007/s10791-016-9291-5 INFORMATION RETRIEVAL EFFICIENCY The role of index compression in score-at-a-time query evaluation Jimmy Lin 1 Andrew Trotman 2 Received: 26 May 2016 / Accepted: 16 December
More informationWindow Extraction for Information Retrieval
Window Extraction for Information Retrieval Samuel Huston Center for Intelligent Information Retrieval University of Massachusetts Amherst Amherst, MA, 01002, USA sjh@cs.umass.edu W. Bruce Croft Center
More informationCost-aware Intersection Caching and Processing Strategies for In-memory Inverted Indexes
Cost-aware Intersection Caching and Processing Strategies for In-memory Inverted Indexes ABSTRACT Esteban Feuerstein Depto. de Computación, FCEyN-UBA Buenos Aires, Argentina efeuerst@dc.uba.ar We propose
More informationIJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 10, 2015 ISSN (online):
IJSRD - International Journal for Scientific Research & Development Vol., Issue, ISSN (online): - Modified Golomb Code for Integer Representation Nelson Raja Joseph Jaganathan P Domnic Sandanam Department
More informationMelbourne University at the 2006 Terabyte Track
Melbourne University at the 2006 Terabyte Track Vo Ngoc Anh William Webber Alistair Moffat Department of Computer Science and Software Engineering The University of Melbourne Victoria 3010, Australia Abstract:
More informationProcessing Posting Lists Using OpenCL
Processing Posting Lists Using OpenCL Advisor Dr. Chris Pollett Committee Members Dr. Thomas Austin Dr. Sami Khuri By Radha Kotipalli Agenda Project Goal About Yioop Inverted index Compression Algorithms
More informationQuery Evaluation Strategies
Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa
More information8 Integer encoding. scritto da: Tiziano De Matteis
8 Integer encoding scritto da: Tiziano De Matteis 8.1 Unary code... 8-2 8.2 Elias codes: γ andδ... 8-2 8.3 Rice code... 8-3 8.4 Interpolative coding... 8-4 8.5 Variable-byte codes and (s,c)-dense codes...
More informationNgram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department
More informationQuery Processing in Highly-Loaded Search Engines
Query Processing in Highly-Loaded Search Engines Daniele Broccolo 1,2, Craig Macdonald 3, Salvatore Orlando 1,2, Iadh Ounis 3, Raffaele Perego 2, Fabrizio Silvestri 2, and Nicola Tonellotto 2 1 Università
More informationStatic Pruning of Terms In Inverted Files
In Inverted Files Roi Blanco and Álvaro Barreiro IRLab University of A Corunna, Spain 29th European Conference on Information Retrieval, Rome, 2007 Motivation : to reduce inverted files size with lossy
More informationTop-k Query Processing with Conditional Skips
Top-k Query Processing with Conditional Skips Edward Bortnikov Yahoo Research, Israel bortnik@yahoo-inc.com David Carmel Yahoo Research, Israel dcarmel@yahoo-inc.com Guy Golan-Gueta VMWare Research, Israel
More informationRecap: lecture 2 CS276A Information Retrieval
Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider
More informationTo Index or not to Index: Time-Space Trade-Offs in Search Engines with Positional Ranking Functions
To Index or not to Index: Time-Space Trade-Offs in Search Engines with Positional Ranking Functions Diego Arroyuelo Dept. of Informatics, Univ. Técnica F. Santa María. Yahoo! Labs Santiago, Chile darroyue@inf.utfsm.cl
More informationInverted Index Compression
Inverted Index Compression Giulio Ermanno Pibiri and Rossano Venturini Department of Computer Science, University of Pisa, Italy giulio.pibiri@di.unipi.it and rossano.venturini@unipi.it Abstract The data
More informationRMIT University at TREC 2006: Terabyte Track
RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo
More information1 Inverted Treaps 1. INTRODUCTION
Inverted Treaps ROBERTO KONOW, Universidad Diego Portales and University of Chile GONZALO NAVARRO, University of Chile CHARLES L.A CLARKE, University of Waterloo ALEJANDRO LÓPEZ-ORTÍZ, University of Waterloo
More informationIndexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information
More informationQuery Evaluation Strategies
Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research
More informationPDF hosted at the Radboud Repository of the Radboud University Nijmegen
PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is a publisher's version. For additional information about this publication click this link. http://hdl.handle.net/2066/173172
More informationIN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)
IN4325 Indexing and query processing Claudia Hauff (WIS, TU Delft) The big picture Information need Topic the user wants to know more about The essence of IR Query Translation of need into an input for
More informationCS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University
CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression
More informationAn Exploration of Postings List Contiguity in Main-Memory Incremental Indexing
An Exploration of Postings List Contiguity in Main-Memory Incremental Indexing Nima Asadi,, Jimmy Lin,, Dept. of Computer Science, Institute for Advanced Computer Studies, The ischool University of Maryland,
More informationA BELIEF NETWORK MODEL FOR EXPERT SEARCH
A BELIEF NETWORK MODEL FOR EXPERT SEARCH Craig Macdonald, Iadh Ounis Department of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK craigm@dcs.gla.ac.uk, ounis@dcs.gla.ac.uk Keywords: Expert
More informationGlobal Statistics in Proximity Weighting Models
Global Statistics in Proximity Weighting Models Craig Macdonald, Iadh Ounis Department of Computing Science University of Glasgow Glasgow, G12 8QQ, UK {craigm,ounis}@dcs.gla.ac.uk ABSTRACT Information
More informationAnalyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc
Analyzing the performance of top-k retrieval algorithms Marcus Fontoura Google, Inc This talk Largely based on the paper Evaluation Strategies for Top-k Queries over Memory-Resident Inverted Indices, VLDB
More informationCOMP6237 Data Mining Searching and Ranking
COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001
More informationA Document-Centric Approach to Static Index Pruning in Text Retrieval Systems
A Document-Centric Approach to Static Index Pruning in Text Retrieval Systems Stefan Büttcher Charles L. A. Clarke School of Computer Science University of Waterloo, Canada {sbuettch,claclark}@plg.uwaterloo.ca
More informationCourse work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?
Course work Introduc)on to Informa(on Retrieval Problem set 1 due Thursday Programming exercise 1 will be handed out today CS276: Informa)on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan
More informationCS60092: Informa0on Retrieval
Introduc)on to CS60092: Informa0on Retrieval Sourangshu Bha1acharya Last lecture index construc)on Sort- based indexing Naïve in- memory inversion Blocked Sort- Based Indexing Merge sort is effec)ve for
More informationInternational Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine
International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains
More informationModularization of Lightweight Data Compression Algorithms
Modularization of Lightweight Data Compression Algorithms Juliana Hildebrandt, Dirk Habich, Patrick Damme and Wolfgang Lehner Technische Universität Dresden Database Systems Group firstname.lastname@tu-dresden.de
More informationIndex Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search
Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationIndexing Strategies of MapReduce for Information Retrieval in Big Data
International Journal of Advances in Computer Science and Technology (IJACST), Vol.5, No.3, Pages : 01-06 (2016) Indexing Strategies of MapReduce for Information Retrieval in Big Data Mazen Farid, Rohaya
More informationIndexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton
Indexing Index Construction CS6200: Information Retrieval Slides by: Jesse Anderton Motivation: Scale Corpus Terms Docs Entries A term incidence matrix with V terms and D documents has O(V x D) entries.
More informationEntry Pairing in Inverted File
Entry Pairing in Inverted File Hoang Thanh Lam 1, Raffaele Perego 2, Nguyen Thoi Minh Quan 3, and Fabrizio Silvestri 2 1 Dip. di Informatica, Università di Pisa, Italy lam@di.unipi.it 2 ISTI-CNR, Pisa,
More informationPhrase Queries with Inverted + Direct Indexes
Phrase Queries with Inverted + Direct Indexes Kiril Panev and Klaus Berberich Max Planck Institute for Informatics, Saarbrücken, Germany {kiril,kberberi}@mpi-inf.mpg.de Abstract. Phrase queries play an
More informationOutline of the course
Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library
More informationInverted List Caching for Topical Index Shards
Inverted List Caching for Topical Index Shards Zhuyun Dai and Jamie Callan Language Technologies Institute, Carnegie Mellon University {zhuyund, callan}@cs.cmu.edu Abstract. Selective search is a distributed
More informationGeneralized indexing and keyword search using User Log
Generalized indexing and keyword search using User Log 1 Yogini Dingorkar, 2 S.Mohan Kumar, 3 Ankush Maind 1 M. Tech Scholar, 2 Coordinator, 3 Assistant Professor Department of Computer Science and Engineering,
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationCorso di Biblioteche Digitali
Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto
More informationA Comparative Study Weighting Schemes for Double Scoring Technique
, October 19-21, 2011, San Francisco, USA A Comparative Study Weighting Schemes for Double Scoring Technique Tanakorn Wichaiwong Member, IAENG and Chuleerat Jaruskulchai Abstract In XML-IR systems, the
More informationMaking Retrieval Faster Through Document Clustering
R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e
More informationTerm Frequency Normalisation Tuning for BM25 and DFR Models
Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 45/65 43/63 (Winter 08) Part 3: Analyzing Text (/) January 30, 08 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationAdministrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks
Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval 1 Outline Dictionaries Wildcard queries skip Edit distance skip Spelling correction skip Soundex 2 Inverted index Our
More informationUniversity of Glasgow at TREC2004: Experiments in Web, Robust and Terabyte tracks with Terrier
University of Glasgow at TREC2004: Experiments in Web, Robust and Terabyte tracks with Terrier Vassilis Plachouras, Ben He, and Iadh Ounis University of Glasgow, G12 8QQ Glasgow, UK Abstract With our participation
More informationIITH at CLEF 2017: Finding Relevant Tweets for Cultural Events
IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events Sreekanth Madisetty and Maunendra Sankar Desarkar Department of CSE, IIT Hyderabad, Hyderabad, India {cs15resch11006, maunendra}@iith.ac.in
More informationWeb Information Retrieval. Lecture 4 Dictionaries, Index Compression
Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data
More informationEfficient Search in Large Textual Collections with Redundancy
Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201, USA zjg@cis.poly.edu, suel@poly.edu ABSTRACT Current
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many
More informationIn-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for Text Retrieval Systems
In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for Text Retrieval Systems Nicholas Lester Justin Zobel Hugh E. Williams School of Computer Science and Information Technology RMIT
More informationCompressing Integers for Fast File Access
Compressing Integers for Fast File Access Hugh E. Williams Justin Zobel Benjamin Tripp COSI 175a: Data Compression October 23, 2006 Introduction Many data processing applications depend on access to integer
More informationColumn Stores versus Search Engines and Applications to Search in Social Networks
Truls A. Bjørklund Column Stores versus Search Engines and Applications to Search in Social Networks Thesis for the degree of philosophiae doctor Trondheim, June 2011 Norwegian University of Science and
More informationEfficiency vs. Effectiveness in Terabyte-Scale IR
Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system
More informationA Cost-Aware Strategy for Query Result Caching in Web Search Engines
A Cost-Aware Strategy for Query Result Caching in Web Search Engines Ismail Sengor Altingovde, Rifat Ozcan, and Özgür Ulusoy Department of Computer Engineering, Bilkent University, Ankara, Turkey {ismaila,rozcan,oulusoy}@cs.bilkent.edu.tr
More informationInformation Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007
Information Retrieval Lecture 3 - Index compression Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Dictionary and inverted index:
More informationCS54701: Information Retrieval
CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful
More informationBasic techniques. Text processing; term weighting; vector space model; inverted index; Web Search
Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More information