Information Retrieval. Lecture 2 - Building an index

Size: px
Start display at page:

Download "Information Retrieval. Lecture 2 - Building an index"

Transcription

1 Information Retrieval Lecture 2 - Building an index Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester / 40

2 Overview Introduction Introduction Boolean model revisited Building the vocabulary Building an index 2/ 40

3 Introduction Introduction Last time: terminology (inverted indexes, dictionary, posting lists, query, etc). conceptual model for IR (information need, query, retrieval of relevant documents) the boolean model (building an index and applying queries) Today: the boolean model revisited (processing phrase queries) how to build a dictionary (i.e. extracting the keywords from data, and the corresponding postings)? 3/ 40

4 Boolean model revisited Part 1. Boolean model (continued) 4/ 40

5 Boolean model revisited Recall - the boolean model an inverted index associate keywords with posting lists the postings lists contain document identifiers (and other useful information, such as total frequences, number of documents, etc.) boolean queries are processed by merging posting lists in order to find the documents satisfaying the query the cost of this list merging is time linear in the total number of document Ids: O(m + n) question: how to process phrase queries (i.e. taking the word s context into account)? 5/ 40

6 Boolean model revisited Processing phrase queries (1 / 3) queries where the context of the keyword matters, examples: Seminar für Sprachwissenschaft graph theory artificial intelligence... the user wants documents were the whole phrase appears, and not only some parts of it (i.e. ich studiere Sprachwissenschaft is not a match) about 10 % of the web queries are phrase queries (e.g. songs names, institutions, etc). such queries need either more complex dictionary terms, or more complex postings (critical parameter: size of the index). 6/ 40

7 Boolean model revisited Processing phrase queries (2 / 3) A) convert phrase queries into conjunctions of biwords use key-phrases of length 2, example: Phrase: Gottlieb Daimler Stadion Dictionary: (a) Gottlieb Daimler (b) Daimler Stadion the dictionary is made of biwords (notion of context) NB: the dictionary gets bigger than with single keywords may give false positives (words do not occur together) 7/ 40

8 Boolean model revisited Processing phrase queries (3 / 3) B) store positions in the inverted indexes, example: termid ::= doc1: position1, position2,... doc2: position1, position2, processing then corresponds to an extension of the merging algorithm (additional checkings while traversing the lists) NB: such indexes can be used to process proximity queries (i.e. using constraints on proximity between words) 8/ 40

9 Example Boolean model revisited Which documents can contain the sentence to be or not to be considering the following (incomplete) indexes? be ::= 1: 7, 18, 33, 72, 86, 231 2: 3, 149 4: 17, 191, 291, 430, 434 5: 363, 367 to ::= 2: 1, 17, 74, 222, 551 4: 8, 16, 190, 429, 433 7: 13, 23, 191 9/ 40

10 Boolean model revisited Size of positional indexes positional indexes need an entry per occurence (NB: classic inverted indexes need an entry per document Id) the size of such indexes grows exponentially with the size of the document the size of a positional index depends on the language being indexed and the type of document (books, articles, etc) on average, a positional index is 2-4 times bigger than a positional index, it can reach 35 to 50 % of the size of the original text (for English) positional indexes can be used in combination with classic indexes to save time and space (see [Williams et al, 2005]). 10/ 40

11 Building the vocabulary Part 2. Building the vocabulary 11/ 40

12 Building the vocabulary 1. Inter-document parsing 2. Intra-document parsing Tokenization Keyword selection Normalization Stemming / Lemmatization 12/ 40

13 Building the vocabulary Parsing the data (1 / 3) 1. Interdocument parsing: processing the raw data to produce a usable collection of documents By usable, we mean only containing a sequence of characters Problems: document s format (pdf, doc, html, etc.) document s encoding (ISO , UTF-8) document s language (automatic recognition) document unit (example: mbox vs Maildir) 13/ 40

14 Building the vocabulary Parsing the data (2 / 3) 2. Intradocument parsing: processing a stream of characters to extract keywords 1st task: tokenization, main difficulties: token delimiters (ex: Chinese) apostrophes (ex: O neill, Finland s capital) hyphens (ex: Hewlett-Packard, state-of-the-art) segmented compound nouns (ex: Los Angeles) unsegmented compound nouns (ex: Sonderforschungsbereich) numerical data (dates, IP addresses) word order (ex: Arabic wrt nouns and numbers) 14/ 40

15 Building the vocabulary Parsing the data (3 / 3) Solutions for tokenization issues: (a) using a pre-defined dictionary with largest matches and heuristics for unknown words (b) using learning algorithms trained over hand-segmented words 15/ 40

16 Building the vocabulary Choosing keywords Selecting the words that are most likely to appear in a query These words characterize the documents they appear in What about the other words? (i.e. noise words) Words that do not carry informative content wrt queries: high frequency terms (the, a, is, etc.) little semantic content such as function words (if, of, to,etc) Question: can these words be determined automatically to compute a stop list (i.e. words that will be discarded during indexing)? 16/ 40

17 Stop list Building the vocabulary How to build a stop list? 17/ 40

18 Stop list Building the vocabulary How to build a stop list? compute the most frequent terms (either as a preprocessing or from other corpora with equivalent distribution)? 17/ 40

19 Stop list Building the vocabulary How to build a stop list? compute the most frequent terms (either as a preprocessing or from other corpora with equivalent distribution)? be careful, some words like home, life or water are among the 200 most frequently used terms in English [Fox, 1992] 17/ 40

20 Stop list Building the vocabulary How to build a stop list? compute the most frequent terms (either as a preprocessing or from other corpora with equivalent distribution)? be careful, some words like home, life or water are among the 200 most frequently used terms in English [Fox, 1992] sort these words by frequency, and apply a semantic filter according to the domain of discourse NB: some common terms are needed for phrase queries (song names,relation queries, etc) 17/ 40

21 Building the vocabulary Normalization of tokens retrieval needs normalized data examples: acronyms (USA vs U.S.A.), dates (22/10/2007 vs 10/22/2007 vs 2007/10/22), diacritics (Tübingen vs Tuebingen), abbreviations (ca. vs circa), typography (university vs University), etc idea: using equivalence classes of terms, ex: { Opel, OPEL, opel } opel alternative: expanding queries to all member of a class (efficiency issues) NB: documents and queries have to be processed using the same tokenization process! 18/ 40

22 Building the vocabulary Stemming and lemmatization role: reducing inflectional forms to common base forms, example: car, cars, car s, cars car am, are, is be stemming removes suffixes (surface markers) to produce root forms lemmatization reduces a word to a canonical form (using a dictionary and a morphological analyser) illustration of the difficulty: plurals (woman/women, crisis/crisi?) derivational morphology (automatize/automate) English Porter stemming algorithm (University of Cambridge, UK, 1980) 19/ 40

23 Building the vocabulary Porter stemmer (1 / 3) algorithm based on a set of context-sensitive rewriting rules martin/porterstemmer/index.html martin/porterstemmer/def.txt rules are composed of a pattern (left-hand-side) and a string (right-hand-side), example: (.*)sses \1 ss (.* [aeiou].*)ed \1 (.* [aeiou].*)y \1 i rules may be constrained by conditions on the word s measure, example: (m > 1) (.*)ement \1 20/ 40

24 Building the vocabulary Porter stemmer (2 / 3) provided a list of consonants is denoted by C, and a list of vowels by V, any word, or part of a word has one of the four forms: CVCV... C CVCV... V VCVC... C VCVC... V these may all be represented by the single form (m is the measure): [C](VC){m}[V]. examples: m=0 TR, EE, TREE, Y, BY. m=1 TROUBLE, OATS, TREES, IVY. m=2 TROUBLES, PRIVATE, OATEN, ORRERY. 21/ 40

25 Building the vocabulary Porter stemmer (3 / 3) sequential application of these reduction rules within 5 phases (total of 60 rules): Step 1 deals with plurals and past participles Steps 2 to 5 dela with English-specific suffixes whithin a phase, in case of ambiguity the longuest suffix match is prefered once the rule is choosen, its success or failure has no importance wrt to the selection of another rule suffix stripping of a vocabulary of 10,000 words: Number of words reduced in step 1: 3597 step 2: 766 step 3: 327 step 4: 2424 step 5: 1373 Number of words not reduced: / 40

26 Building the vocabulary About existing stemmers (1 / 2) Julie Beth Lovin s stemmer: stemming/general/lovins.htm developed at MIT (US) in 1968 single-pass algorithm, context sensitive, iterative longest-match, 297 endings patterns, use a recoding phase to deal with spelling ambiguity 23/ 40

27 Building the vocabulary About existing stemmers (2 / 2) Paice / Husk s stemmer: stemming/links/paice.htm developed at Lancaster University (UK) in 1990 single table of rules, specifying the removal or replacement of an ending indefinite number of stages, rules are indexed by the last letter of the ending, agressive suffix removal 24/ 40

28 Building the vocabulary About stemming Dangers of stemming: information retrieval vs. information on Golden Retrievers gravity vs gravitation (examples from resp. Amit Singhal and Richard Belew) usefulness of stemming? compressed index, vocabulary reduced from 10 to 50 % (fast processing) vs inconvenience due to poor retrieval (loss of meaning brought by the context) benefits of stemming depend on the language (suits better languages with rich word inflections) 25/ 40

29 Building an index Part 3. Building an index 26/ 40

30 Building an index 1. Block-merge indexing 2. Single-pass indexing 3. Distributed indexing 4. Dynamic indexing 27/ 40

31 Building an index Efficiency considerations (1 / 2) Indexing algorithms depend on hardware characteristics Accessing data in memory is faster than on disk Moving the disk head to a non-contiguous area is time-consuming Operating systems read and write blocks of bytes. Reading a single byte needs the same time as reading the whole block (common block size 2 n KB) Some figures (2007 standards): a disk seek takes s a block transfer from disk takes 10 7 s per byte a processor operation takes 10 7 s a processor s clock cycle is 10 9 s (avg. 1Ghz) 28/ 40

32 Building an index Efficiency considerations (2 / 2) To give an idea: example of the REUTERS collection (Aug. 96 Aug. 97): statistic value documents 800,000 avg. # word tokens per document 200 word types 400,000 avg. # bytes per token (incl. spaces/punct.) 6 avg. # bytes per token (without spaces/punct.) 4.5 avg. # bytes per word type 7.5 non-positional postings 100,000,000 each posting entry is encoded using 12 bytes (4+4+4, term, doc, freq) 29/ 40

33 Recall: indexing Building an index An index can be build by assembling all postings (term, docid) via a first pass through the collection Then postings are sorted according to their term and docid Finally, postings belonging to a given term are compacted into a posting list, and statistics about terms are computed NB: for efficiency reasons, a term can be represented by a termid 30/ 40

34 Building an index Block-merge indexing (1 / 3) The list of posting entries may be huge and not fit the memory space To avoid this, one use an external sorting algorithm Note that disk seeks have to be minimized Principles of the block merge algorithm: (a) split the collection into parts of equal size, (b) sort the postings (termid, docid) corresponding to a part of the collection in memory, (c) store the intermediate result on disk, and (d) merge all intermediate results to produce the final index 31/ 40

35 Building an index Block-merge indexing (2 / 3) Block merge algorithm (from [Manning et al,07]): 1 blockmerge(collection c) 2 n <- 1 3 do 4 block <- parsenextblock(c) 5 invert(block) 6 writetodisc(block, fn) 7 n <- n+1 8 while (c!= []) 9 endwhile 10 return merge([f1.. fn]) NB: merging needs to now the term-termid mapping. 32/ 40

36 Building an index Block-merge indexing (3 / 3) Disk transfer time during merge with the REUTERS sample ( postings): 64 blocks ( entries) 12 bytes b 2 transfers 4, 1 minutes b = 10 7 seconds/byte is the byte transfer rate. 33/ 40

37 Building an index Single-pass indexing (1 / 2) Limit of block merge method: the mapping term-termid may not fit into memory Idea: create a dictionary for each block, and store this dictionary on disk (NB: the dictionary contains the term) A final step merges the dictionaries and postings Principles of the single-pass indexing algorithm: (a) split the collection, (b) create a dictionary for the current split, (c) for each token, if it belongs to the current dictionary, retrieve its postings, if not, create a new empty posting list for it, and add a new entry to the dictionary, (d) add the current token to the posting list, (e) sort the terms, and (f) write the block and dictionary to the disc 34/ 40

38 Building an index Single-pass indexing (2 / 2) Single-pass indexing algorithm (from [Manning et al,07]): 1 invert(stream c) 2 output <- new File() 3 dictionary <- new Hash() 4 while(size < sizemax) do 5 tok <- next(c) 6 if not(token in dictionary) 7 then p_list <- addtodict(dictionary, term(tok)) 8 else p_list <- getfromdict(dictionary,term(tok)) 9 if (full(postin_list)) 10 then p_list <- doublesize(posting_list) 11 addtolist(p_list, tok) 12 endwhile 13 sorted <- sortterms(dictionary) 14 writetodisc(sorted, dictionary, output) 15 return output 35/ 40

39 Building an index Distributed indexing (1 / 4) Huge collections the index cannot be computed on a single machine, it is then partitioned accross several machines We can choose to partition either the postings or the keywords (document-partition versus term-partition) In all cases, the mapping term-termid must be consistent (e.g. precomputed) General idea: using a cluster of machine, each representing a node that performs a sub-task A node may crash, in that case the task it has been given is reallocated to another node A master node manages these task-allocations (e.g. robustness, synchronization) 36/ 40

40 Building an index Distributed indexing (2 / 4) General architecture for indexing: MapReduce The collection is split (size of each fragment computed for efficiency, trade-off between flexibility and read/write time access) The Map phase applies on each split a parser which computes the posting lists contained in the split The output of a parsing is store in local intermediate files called segment files Each segment file contains the postings for some terms (c.f. term-partition) The Reduce phase assign an inverter to a term-partition, this inverter collects the corresponding postings within the segment files 37/ 40

41 Building an index Distributed indexing (3 / 4) J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters (2004) mapreduce-osdi04.pdf 38/ 40

42 Building an index Distributed indexing (4 / 4) Remarks: The number of partition is a parameter of the indexing system The posting list associated with a single term is supposed to fit in a single machine s memory Parsers and inverters are not separate machines, according to their availability, the master node assigns a machine either parsing or inverting For efficiency reasons, the network traffic is reduced as much as possible 39/ 40

43 Building an index To be continued... 40/ 40

Information Retrieval CS-E credits

Information Retrieval CS-E credits Information Retrieval CS-E4420 5 credits Tokenization, further indexing issues Antti Ukkonen antti.ukkonen@aalto.fi Slides are based on materials by Tuukka Ruotsalo, Hinrich Schütze and Christina Lioma

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 02 The Term Vocabulary and Postings Lists Constructing Inverted Indexes The major steps in constructing

More information

Web Information Retrieval. Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries

Web Information Retrieval. Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries Web Information Retrieval Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in construction:

More information

Index Construction 1

Index Construction 1 Index Construction 1 October, 2009 1 Vorlage: Folien von M. Schütze 1 von 43 Index Construction Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap to to Information Retrieval Index Construct Ruixuan Li Huazhong University of Science and Technology http://idc.hust.edu.cn/~rxli/ October, 2012 1 2 How to construct index? Computerese term document docid

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards This time: Spell correction Soundex Index construction Index

More information

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview

More information

Text Pre-processing and Faster Query Processing

Text Pre-processing and Faster Query Processing Text Pre-processing and Faster Query Processing David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture2-dictionary.ppt Administrative Everyone have CS lab accounts/access?

More information

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 2: The term vocabulary Ch. 1 Recap of the previous lecture Basic inverted

More information

Recap of the previous lecture. Recall the basic indexing pipeline. Plan for this lecture. Parsing a document. Introduction to Information Retrieval

Recap of the previous lecture. Recall the basic indexing pipeline. Plan for this lecture. Parsing a document. Introduction to Information Retrieval Ch. Introduction to Information Retrieval Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Lecture 2: The term vocabulary and postings lists Key step in construction:

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 2: Preprocessing 1 Ch. 1 Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in construction: Sorting Boolean

More information

n Tuesday office hours changed: n 2-3pm n Homework 1 due Tuesday n Assignment 1 n Due next Friday n Can work with a partner

n Tuesday office hours changed: n 2-3pm n Homework 1 due Tuesday n Assignment 1 n Due next Friday n Can work with a partner Administrative Text Pre-processing and Faster Query Processing" David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture2-dictionary.ppt Tuesday office hours changed:

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 4: Index Construction Hinrich Schütze, Christina Lioma Institute for Natural Language Processing, University of Stuttgart 2010-05-04

More information

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007 Information Retrieval Lecture 5 - The vector space model Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 28 Introduction Boolean model: all documents

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology More on indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Plan

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing

More information

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007 Information Retrieval Lecture 3 - Index compression Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Dictionary and inverted index:

More information

Chapter 4. Processing Text

Chapter 4. Processing Text Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are

More information

Information Retrieval

Information Retrieval Introduction to CS3245 Lecture 5: Index Construction 5 Last Time Dictionary data structures Tolerant retrieval Wildcards Spelling correction Soundex a-hu hy-m n-z $m mace madden mo among amortize on abandon

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 4: Index Construction Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-16 Schütze:

More information

Information Retrieval

Information Retrieval Introduction to CS3245 Lecture 5: Index Construction 5 CS3245 Last Time Dictionary data structures Tolerant retrieval Wildcards Spelling correction Soundex a-hu hy-m n-z $m mace madden mo among amortize

More information

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the

More information

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Index Construction Overview Introduction

More information

More about Posting Lists

More about Posting Lists More about Posting Lists 1 FASTER POSTINGS MERGES: SKIP POINTERS/SKIP LISTS 2 Sec. 2.3 Recall basic merge Walk through the two postings simultaneously, in time linear in the total number of postings entries

More information

Corso di Biblioteche Digitali

Corso di Biblioteche Digitali Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto

More information

Text Retrieval an introduction

Text Retrieval an introduction Text Retrieval an introduction Michalis Vazirgiannis Nov. 2012 Outline Document collection preprocessing Feature Selection Indexing Query processing & Ranking Text representation for Information Retrieval

More information

Index Construction. Slides by Manning, Raghavan, Schutze

Index Construction. Slides by Manning, Raghavan, Schutze Introduction to Information Retrieval ΕΠΛ660 Ανάκτηση Πληροφοριών και Μηχανές Αναζήτησης ης Index Construction ti Introduction to Information Retrieval Plan Last lecture: Dictionary data structures Tolerant

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 4: Index Construction Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-16 1/54 Overview

More information

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 Information Retrieval Lecture 6: Index Compression 6 Last Time: index construction Sort- based indexing Blocked Sort- Based Indexing Merge sort is effective

More information

INDEX CONSTRUCTION 1

INDEX CONSTRUCTION 1 1 INDEX CONSTRUCTION PLAN Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden This time: mo among amortize Index construction on

More information

Digital Libraries: Language Technologies

Digital Libraries: Language Technologies Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................

More information

Information Retrieval. Danushka Bollegala

Information Retrieval. Danushka Bollegala Information Retrieval Danushka Bollegala Anatomy of a Search Engine Document Indexing Query Processing Search Index Results Ranking 2 Document Processing Format detection Plain text, PDF, PPT, Text extraction

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö. Outline Lecture 3: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University February 5, 2013 A. Ardö, EIT Lecture 3: EITN01 Web Intelligence

More information

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Introduction to Information Retrieval and Boolean model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Unstructured (text) vs. structured (database) data in late

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Hamid Rastegari Lecture 4: Index Construction Plan Last lecture: Dictionary data structures

More information

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS4611: Information Retrieval Professor M. P. Schellekens Assistant: Ang Gao Slides adapted from P. Nayak and P. Raghavan Information Retrieval Lecture 2: The term

More information

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Index Compression. David Kauchak cs160 Fall 2009 adapted from: Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa

More information

CSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4)

CSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4) CSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze Institute for Natural

More information

Information Retrieval. Lecture 10 - Web crawling

Information Retrieval. Lecture 10 - Web crawling Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the

More information

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Index Construction Overview Introduction Hardware

More information

Influence of Word Normalization on Text Classification

Influence of Word Normalization on Text Classification Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep

More information

Lecture 5: Information Retrieval using the Vector Space Model

Lecture 5: Information Retrieval using the Vector Space Model Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Information Retrieval

Information Retrieval Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Mustafa Jarrar: Lecture Notes on Information Retrieval University of Birzeit, Palestine 2014 Introduction to Information Retrieval Dr. Mustafa Jarrar Sina Institute, University of Birzeit mjarrar@birzeit.edu

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 4: Index construction Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,

More information

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data

More information

Outline of the course

Outline of the course Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library

More information

CS276 Information Retrieval and Web Search. Lecture 2: Dictionary and Postings

CS276 Information Retrieval and Web Search. Lecture 2: Dictionary and Postings CS276 Information Retrieval and Web Search Lecture 2: Dictionary and Postings Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in construction: Sorting

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 4: Indexing April 27, 2010 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Recap: Inverted Indexes

More information

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries Hans Philippi (based on the slides from the Stanford course on IR) April 25, 2018 Boolean retrieval, posting lists & dictionaries

More information

CS105 Introduction to Information Retrieval

CS105 Introduction to Information Retrieval CS105 Introduction to Information Retrieval Lecture: Yang Mu UMass Boston Slides are modified from: http://www.stanford.edu/class/cs276/ Information Retrieval Information Retrieval (IR) is finding material

More information

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft) IN4325 Indexing and query processing Claudia Hauff (WIS, TU Delft) The big picture Information need Topic the user wants to know more about The essence of IR Query Translation of need into an input for

More information

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes? Course work Introduc)on to Informa(on Retrieval Problem set 1 due Thursday Programming exercise 1 will be handed out today CS276: Informa)on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

COMP6237 Data Mining Searching and Ranking

COMP6237 Data Mining Searching and Ranking COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001

More information

CS60092: Informa0on Retrieval

CS60092: Informa0on Retrieval Introduc)on to CS60092: Informa0on Retrieval Sourangshu Bha1acharya Last lecture index construc)on Sort- based indexing Naïve in- memory inversion Blocked Sort- Based Indexing Merge sort is effec)ve for

More information

IR System Components. Lecture 2: Data structures and Algorithms for Indexing. IR System Components. IR System Components

IR System Components. Lecture 2: Data structures and Algorithms for Indexing. IR System Components. IR System Components IR System Components Lecture 2: Data structures and Algorithms for Indexing Information Retrieval Computer Science Tripos Part II Document Collection Ronan Cummins 1 Natural Language and Information Processing

More information

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects

More information

Web Page Similarity Searching Based on Web Content

Web Page Similarity Searching Based on Web Content Web Page Similarity Searching Based on Web Content Gregorius Satia Budhi Informatics Department Petra Chistian University Siwalankerto 121-131 Surabaya 60236, Indonesia (62-31) 2983455 greg@petra.ac.id

More information

Reuters collection example (approximate # s)

Reuters collection example (approximate # s) BSBI Reuters collection example (approximate # s) 800,000 documents from the Reuters news feed 200 terms per document 400,000 unique terms number of postings 100,000,000 BSBI Reuters collection example

More information

Document Representation : Quiz

Document Representation : Quiz Document Representation : Quiz Q1. In-memory Index construction faces following problems:. (A) Scaling problem (B) The optimal use of Hardware resources for scaling (C) Easily keep entire data into main

More information

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Web Information Retrieval Exercises Boolean query answering. Prof. Luca Becchetti

Web Information Retrieval Exercises Boolean query answering. Prof. Luca Becchetti Web Information Retrieval Exercises Boolean query answering Prof. Luca Becchetti Material rif 3. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schueze, Introduction to Information Retrieval, Cambridge

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs

More information

2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response

2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response CMSC 476/676 Review 1. Week 1 Overview of Information Retrieval a. Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

Lexical Analysis. Lecture 3-4

Lexical Analysis. Lecture 3-4 Lexical Analysis Lecture 3-4 Notes by G. Necula, with additions by P. Hilfinger Prof. Hilfinger CS 164 Lecture 3-4 1 Administrivia I suggest you start looking at Python (see link on class home page). Please

More information

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak Introducing Information Retrieval and Web Search borrowing from: Pandu Nayak Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually

More information