Information Retrieval. Lecture 2 - Building an index
|
|
- Agnes Patrick
- 6 years ago
- Views:
Transcription
1 Information Retrieval Lecture 2 - Building an index Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester / 40
2 Overview Introduction Introduction Boolean model revisited Building the vocabulary Building an index 2/ 40
3 Introduction Introduction Last time: terminology (inverted indexes, dictionary, posting lists, query, etc). conceptual model for IR (information need, query, retrieval of relevant documents) the boolean model (building an index and applying queries) Today: the boolean model revisited (processing phrase queries) how to build a dictionary (i.e. extracting the keywords from data, and the corresponding postings)? 3/ 40
4 Boolean model revisited Part 1. Boolean model (continued) 4/ 40
5 Boolean model revisited Recall - the boolean model an inverted index associate keywords with posting lists the postings lists contain document identifiers (and other useful information, such as total frequences, number of documents, etc.) boolean queries are processed by merging posting lists in order to find the documents satisfaying the query the cost of this list merging is time linear in the total number of document Ids: O(m + n) question: how to process phrase queries (i.e. taking the word s context into account)? 5/ 40
6 Boolean model revisited Processing phrase queries (1 / 3) queries where the context of the keyword matters, examples: Seminar für Sprachwissenschaft graph theory artificial intelligence... the user wants documents were the whole phrase appears, and not only some parts of it (i.e. ich studiere Sprachwissenschaft is not a match) about 10 % of the web queries are phrase queries (e.g. songs names, institutions, etc). such queries need either more complex dictionary terms, or more complex postings (critical parameter: size of the index). 6/ 40
7 Boolean model revisited Processing phrase queries (2 / 3) A) convert phrase queries into conjunctions of biwords use key-phrases of length 2, example: Phrase: Gottlieb Daimler Stadion Dictionary: (a) Gottlieb Daimler (b) Daimler Stadion the dictionary is made of biwords (notion of context) NB: the dictionary gets bigger than with single keywords may give false positives (words do not occur together) 7/ 40
8 Boolean model revisited Processing phrase queries (3 / 3) B) store positions in the inverted indexes, example: termid ::= doc1: position1, position2,... doc2: position1, position2, processing then corresponds to an extension of the merging algorithm (additional checkings while traversing the lists) NB: such indexes can be used to process proximity queries (i.e. using constraints on proximity between words) 8/ 40
9 Example Boolean model revisited Which documents can contain the sentence to be or not to be considering the following (incomplete) indexes? be ::= 1: 7, 18, 33, 72, 86, 231 2: 3, 149 4: 17, 191, 291, 430, 434 5: 363, 367 to ::= 2: 1, 17, 74, 222, 551 4: 8, 16, 190, 429, 433 7: 13, 23, 191 9/ 40
10 Boolean model revisited Size of positional indexes positional indexes need an entry per occurence (NB: classic inverted indexes need an entry per document Id) the size of such indexes grows exponentially with the size of the document the size of a positional index depends on the language being indexed and the type of document (books, articles, etc) on average, a positional index is 2-4 times bigger than a positional index, it can reach 35 to 50 % of the size of the original text (for English) positional indexes can be used in combination with classic indexes to save time and space (see [Williams et al, 2005]). 10/ 40
11 Building the vocabulary Part 2. Building the vocabulary 11/ 40
12 Building the vocabulary 1. Inter-document parsing 2. Intra-document parsing Tokenization Keyword selection Normalization Stemming / Lemmatization 12/ 40
13 Building the vocabulary Parsing the data (1 / 3) 1. Interdocument parsing: processing the raw data to produce a usable collection of documents By usable, we mean only containing a sequence of characters Problems: document s format (pdf, doc, html, etc.) document s encoding (ISO , UTF-8) document s language (automatic recognition) document unit (example: mbox vs Maildir) 13/ 40
14 Building the vocabulary Parsing the data (2 / 3) 2. Intradocument parsing: processing a stream of characters to extract keywords 1st task: tokenization, main difficulties: token delimiters (ex: Chinese) apostrophes (ex: O neill, Finland s capital) hyphens (ex: Hewlett-Packard, state-of-the-art) segmented compound nouns (ex: Los Angeles) unsegmented compound nouns (ex: Sonderforschungsbereich) numerical data (dates, IP addresses) word order (ex: Arabic wrt nouns and numbers) 14/ 40
15 Building the vocabulary Parsing the data (3 / 3) Solutions for tokenization issues: (a) using a pre-defined dictionary with largest matches and heuristics for unknown words (b) using learning algorithms trained over hand-segmented words 15/ 40
16 Building the vocabulary Choosing keywords Selecting the words that are most likely to appear in a query These words characterize the documents they appear in What about the other words? (i.e. noise words) Words that do not carry informative content wrt queries: high frequency terms (the, a, is, etc.) little semantic content such as function words (if, of, to,etc) Question: can these words be determined automatically to compute a stop list (i.e. words that will be discarded during indexing)? 16/ 40
17 Stop list Building the vocabulary How to build a stop list? 17/ 40
18 Stop list Building the vocabulary How to build a stop list? compute the most frequent terms (either as a preprocessing or from other corpora with equivalent distribution)? 17/ 40
19 Stop list Building the vocabulary How to build a stop list? compute the most frequent terms (either as a preprocessing or from other corpora with equivalent distribution)? be careful, some words like home, life or water are among the 200 most frequently used terms in English [Fox, 1992] 17/ 40
20 Stop list Building the vocabulary How to build a stop list? compute the most frequent terms (either as a preprocessing or from other corpora with equivalent distribution)? be careful, some words like home, life or water are among the 200 most frequently used terms in English [Fox, 1992] sort these words by frequency, and apply a semantic filter according to the domain of discourse NB: some common terms are needed for phrase queries (song names,relation queries, etc) 17/ 40
21 Building the vocabulary Normalization of tokens retrieval needs normalized data examples: acronyms (USA vs U.S.A.), dates (22/10/2007 vs 10/22/2007 vs 2007/10/22), diacritics (Tübingen vs Tuebingen), abbreviations (ca. vs circa), typography (university vs University), etc idea: using equivalence classes of terms, ex: { Opel, OPEL, opel } opel alternative: expanding queries to all member of a class (efficiency issues) NB: documents and queries have to be processed using the same tokenization process! 18/ 40
22 Building the vocabulary Stemming and lemmatization role: reducing inflectional forms to common base forms, example: car, cars, car s, cars car am, are, is be stemming removes suffixes (surface markers) to produce root forms lemmatization reduces a word to a canonical form (using a dictionary and a morphological analyser) illustration of the difficulty: plurals (woman/women, crisis/crisi?) derivational morphology (automatize/automate) English Porter stemming algorithm (University of Cambridge, UK, 1980) 19/ 40
23 Building the vocabulary Porter stemmer (1 / 3) algorithm based on a set of context-sensitive rewriting rules martin/porterstemmer/index.html martin/porterstemmer/def.txt rules are composed of a pattern (left-hand-side) and a string (right-hand-side), example: (.*)sses \1 ss (.* [aeiou].*)ed \1 (.* [aeiou].*)y \1 i rules may be constrained by conditions on the word s measure, example: (m > 1) (.*)ement \1 20/ 40
24 Building the vocabulary Porter stemmer (2 / 3) provided a list of consonants is denoted by C, and a list of vowels by V, any word, or part of a word has one of the four forms: CVCV... C CVCV... V VCVC... C VCVC... V these may all be represented by the single form (m is the measure): [C](VC){m}[V]. examples: m=0 TR, EE, TREE, Y, BY. m=1 TROUBLE, OATS, TREES, IVY. m=2 TROUBLES, PRIVATE, OATEN, ORRERY. 21/ 40
25 Building the vocabulary Porter stemmer (3 / 3) sequential application of these reduction rules within 5 phases (total of 60 rules): Step 1 deals with plurals and past participles Steps 2 to 5 dela with English-specific suffixes whithin a phase, in case of ambiguity the longuest suffix match is prefered once the rule is choosen, its success or failure has no importance wrt to the selection of another rule suffix stripping of a vocabulary of 10,000 words: Number of words reduced in step 1: 3597 step 2: 766 step 3: 327 step 4: 2424 step 5: 1373 Number of words not reduced: / 40
26 Building the vocabulary About existing stemmers (1 / 2) Julie Beth Lovin s stemmer: stemming/general/lovins.htm developed at MIT (US) in 1968 single-pass algorithm, context sensitive, iterative longest-match, 297 endings patterns, use a recoding phase to deal with spelling ambiguity 23/ 40
27 Building the vocabulary About existing stemmers (2 / 2) Paice / Husk s stemmer: stemming/links/paice.htm developed at Lancaster University (UK) in 1990 single table of rules, specifying the removal or replacement of an ending indefinite number of stages, rules are indexed by the last letter of the ending, agressive suffix removal 24/ 40
28 Building the vocabulary About stemming Dangers of stemming: information retrieval vs. information on Golden Retrievers gravity vs gravitation (examples from resp. Amit Singhal and Richard Belew) usefulness of stemming? compressed index, vocabulary reduced from 10 to 50 % (fast processing) vs inconvenience due to poor retrieval (loss of meaning brought by the context) benefits of stemming depend on the language (suits better languages with rich word inflections) 25/ 40
29 Building an index Part 3. Building an index 26/ 40
30 Building an index 1. Block-merge indexing 2. Single-pass indexing 3. Distributed indexing 4. Dynamic indexing 27/ 40
31 Building an index Efficiency considerations (1 / 2) Indexing algorithms depend on hardware characteristics Accessing data in memory is faster than on disk Moving the disk head to a non-contiguous area is time-consuming Operating systems read and write blocks of bytes. Reading a single byte needs the same time as reading the whole block (common block size 2 n KB) Some figures (2007 standards): a disk seek takes s a block transfer from disk takes 10 7 s per byte a processor operation takes 10 7 s a processor s clock cycle is 10 9 s (avg. 1Ghz) 28/ 40
32 Building an index Efficiency considerations (2 / 2) To give an idea: example of the REUTERS collection (Aug. 96 Aug. 97): statistic value documents 800,000 avg. # word tokens per document 200 word types 400,000 avg. # bytes per token (incl. spaces/punct.) 6 avg. # bytes per token (without spaces/punct.) 4.5 avg. # bytes per word type 7.5 non-positional postings 100,000,000 each posting entry is encoded using 12 bytes (4+4+4, term, doc, freq) 29/ 40
33 Recall: indexing Building an index An index can be build by assembling all postings (term, docid) via a first pass through the collection Then postings are sorted according to their term and docid Finally, postings belonging to a given term are compacted into a posting list, and statistics about terms are computed NB: for efficiency reasons, a term can be represented by a termid 30/ 40
34 Building an index Block-merge indexing (1 / 3) The list of posting entries may be huge and not fit the memory space To avoid this, one use an external sorting algorithm Note that disk seeks have to be minimized Principles of the block merge algorithm: (a) split the collection into parts of equal size, (b) sort the postings (termid, docid) corresponding to a part of the collection in memory, (c) store the intermediate result on disk, and (d) merge all intermediate results to produce the final index 31/ 40
35 Building an index Block-merge indexing (2 / 3) Block merge algorithm (from [Manning et al,07]): 1 blockmerge(collection c) 2 n <- 1 3 do 4 block <- parsenextblock(c) 5 invert(block) 6 writetodisc(block, fn) 7 n <- n+1 8 while (c!= []) 9 endwhile 10 return merge([f1.. fn]) NB: merging needs to now the term-termid mapping. 32/ 40
36 Building an index Block-merge indexing (3 / 3) Disk transfer time during merge with the REUTERS sample ( postings): 64 blocks ( entries) 12 bytes b 2 transfers 4, 1 minutes b = 10 7 seconds/byte is the byte transfer rate. 33/ 40
37 Building an index Single-pass indexing (1 / 2) Limit of block merge method: the mapping term-termid may not fit into memory Idea: create a dictionary for each block, and store this dictionary on disk (NB: the dictionary contains the term) A final step merges the dictionaries and postings Principles of the single-pass indexing algorithm: (a) split the collection, (b) create a dictionary for the current split, (c) for each token, if it belongs to the current dictionary, retrieve its postings, if not, create a new empty posting list for it, and add a new entry to the dictionary, (d) add the current token to the posting list, (e) sort the terms, and (f) write the block and dictionary to the disc 34/ 40
38 Building an index Single-pass indexing (2 / 2) Single-pass indexing algorithm (from [Manning et al,07]): 1 invert(stream c) 2 output <- new File() 3 dictionary <- new Hash() 4 while(size < sizemax) do 5 tok <- next(c) 6 if not(token in dictionary) 7 then p_list <- addtodict(dictionary, term(tok)) 8 else p_list <- getfromdict(dictionary,term(tok)) 9 if (full(postin_list)) 10 then p_list <- doublesize(posting_list) 11 addtolist(p_list, tok) 12 endwhile 13 sorted <- sortterms(dictionary) 14 writetodisc(sorted, dictionary, output) 15 return output 35/ 40
39 Building an index Distributed indexing (1 / 4) Huge collections the index cannot be computed on a single machine, it is then partitioned accross several machines We can choose to partition either the postings or the keywords (document-partition versus term-partition) In all cases, the mapping term-termid must be consistent (e.g. precomputed) General idea: using a cluster of machine, each representing a node that performs a sub-task A node may crash, in that case the task it has been given is reallocated to another node A master node manages these task-allocations (e.g. robustness, synchronization) 36/ 40
40 Building an index Distributed indexing (2 / 4) General architecture for indexing: MapReduce The collection is split (size of each fragment computed for efficiency, trade-off between flexibility and read/write time access) The Map phase applies on each split a parser which computes the posting lists contained in the split The output of a parsing is store in local intermediate files called segment files Each segment file contains the postings for some terms (c.f. term-partition) The Reduce phase assign an inverter to a term-partition, this inverter collects the corresponding postings within the segment files 37/ 40
41 Building an index Distributed indexing (3 / 4) J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters (2004) mapreduce-osdi04.pdf 38/ 40
42 Building an index Distributed indexing (4 / 4) Remarks: The number of partition is a parameter of the indexing system The posting list associated with a single term is supposed to fit in a single machine s memory Parsers and inverters are not separate machines, according to their availability, the master node assigns a machine either parsing or inverting For efficiency reasons, the network traffic is reduced as much as possible 39/ 40
43 Building an index To be continued... 40/ 40
Information Retrieval CS-E credits
Information Retrieval CS-E4420 5 credits Tokenization, further indexing issues Antti Ukkonen antti.ukkonen@aalto.fi Slides are based on materials by Tuukka Ruotsalo, Hinrich Schütze and Christina Lioma
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 02 The Term Vocabulary and Postings Lists Constructing Inverted Indexes The major steps in constructing
More informationWeb Information Retrieval. Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries
Web Information Retrieval Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in construction:
More informationIndex Construction 1
Index Construction 1 October, 2009 1 Vorlage: Folien von M. Schütze 1 von 43 Index Construction Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationindex construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap
to to Information Retrieval Index Construct Ruixuan Li Huazhong University of Science and Technology http://idc.hust.edu.cn/~rxli/ October, 2012 1 2 How to construct index? Computerese term document docid
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant
More informationIndex Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search
Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards This time: Spell correction Soundex Index construction Index
More information3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview
More informationText Pre-processing and Faster Query Processing
Text Pre-processing and Faster Query Processing David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture2-dictionary.ppt Administrative Everyone have CS lab accounts/access?
More informationAdministrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks
Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 2: The term vocabulary Ch. 1 Recap of the previous lecture Basic inverted
More informationRecap of the previous lecture. Recall the basic indexing pipeline. Plan for this lecture. Parsing a document. Introduction to Information Retrieval
Ch. Introduction to Information Retrieval Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Lecture 2: The term vocabulary and postings lists Key step in construction:
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 2: Preprocessing 1 Ch. 1 Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in construction: Sorting Boolean
More informationn Tuesday office hours changed: n 2-3pm n Homework 1 due Tuesday n Assignment 1 n Due next Friday n Can work with a partner
Administrative Text Pre-processing and Faster Query Processing" David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture2-dictionary.ppt Tuesday office hours changed:
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 4: Index Construction Hinrich Schütze, Christina Lioma Institute for Natural Language Processing, University of Stuttgart 2010-05-04
More informationInformation Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007
Information Retrieval Lecture 5 - The vector space model Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 28 Introduction Boolean model: all documents
More informationChapter 2. Architecture of a Search Engine
Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them
More informationMore on indexing CE-324: Modern Information Retrieval Sharif University of Technology
More on indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Plan
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing
More informationMore on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology
More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationInformation Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007
Information Retrieval Lecture 3 - Index compression Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Dictionary and inverted index:
More informationChapter 4. Processing Text
Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are
More informationInformation Retrieval
Introduction to CS3245 Lecture 5: Index Construction 5 Last Time Dictionary data structures Tolerant retrieval Wildcards Spelling correction Soundex a-hu hy-m n-z $m mace madden mo among amortize on abandon
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 4: Index Construction Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-16 Schütze:
More informationInformation Retrieval
Introduction to CS3245 Lecture 5: Index Construction 5 CS3245 Last Time Dictionary data structures Tolerant retrieval Wildcards Spelling correction Soundex a-hu hy-m n-z $m mace madden mo among amortize
More informationInverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5
Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the
More informationMore on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology
More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationIndex Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson
Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Index Construction Overview Introduction
More informationMore about Posting Lists
More about Posting Lists 1 FASTER POSTINGS MERGES: SKIP POINTERS/SKIP LISTS 2 Sec. 2.3 Recall basic merge Walk through the two postings simultaneously, in time linear in the total number of postings entries
More informationCorso di Biblioteche Digitali
Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto
More informationText Retrieval an introduction
Text Retrieval an introduction Michalis Vazirgiannis Nov. 2012 Outline Document collection preprocessing Feature Selection Indexing Query processing & Ranking Text representation for Information Retrieval
More informationIndex Construction. Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval ΕΠΛ660 Ανάκτηση Πληροφοριών και Μηχανές Αναζήτησης ης Index Construction ti Introduction to Information Retrieval Plan Last lecture: Dictionary data structures Tolerant
More informationIntroduction to Information Retrieval
Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 4: Index Construction Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-16 1/54 Overview
More informationIntroduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 Information Retrieval Lecture 6: Index Compression 6 Last Time: index construction Sort- based indexing Blocked Sort- Based Indexing Merge sort is effective
More informationINDEX CONSTRUCTION 1
1 INDEX CONSTRUCTION PLAN Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden This time: mo among amortize Index construction on
More informationDigital Libraries: Language Technologies
Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................
More informationInformation Retrieval. Danushka Bollegala
Information Retrieval Danushka Bollegala Anatomy of a Search Engine Document Indexing Query Processing Search Index Results Ranking 2 Document Processing Format detection Plain text, PDF, PPT, Text extraction
More informationmodern database systems lecture 4 : information retrieval
modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation
More informationOutline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.
Outline Lecture 3: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University February 5, 2013 A. Ardö, EIT Lecture 3: EITN01 Web Intelligence
More informationIntroduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.
Introduction to Information Retrieval and Boolean model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Unstructured (text) vs. structured (database) data in late
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Hamid Rastegari Lecture 4: Index Construction Plan Last lecture: Dictionary data structures
More informationEECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling
EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report
More informationInformation Retrieval
Introduction to Information Retrieval CS4611: Information Retrieval Professor M. P. Schellekens Assistant: Ang Gao Slides adapted from P. Nayak and P. Raghavan Information Retrieval Lecture 2: The term
More informationIndex Compression. David Kauchak cs160 Fall 2009 adapted from:
Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?
More informationQuery Evaluation Strategies
Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa
More informationCSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4)
CSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze Institute for Natural
More informationInformation Retrieval. Lecture 10 - Web crawling
Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the
More informationIndex Construction Introduction to Information Retrieval INF 141 Donald J. Patterson
Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Index Construction Overview Introduction Hardware
More informationInfluence of Word Normalization on Text Classification
Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationIndexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep
More informationLecture 5: Information Retrieval using the Vector Space Model
Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationInformation Retrieval
Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid
More informationIntroduction to Information Retrieval
Mustafa Jarrar: Lecture Notes on Information Retrieval University of Birzeit, Palestine 2014 Introduction to Information Retrieval Dr. Mustafa Jarrar Sina Institute, University of Birzeit mjarrar@birzeit.edu
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 4: Index construction Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,
More informationWeb Information Retrieval. Lecture 4 Dictionaries, Index Compression
Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data
More informationOutline of the course
Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library
More informationCS276 Information Retrieval and Web Search. Lecture 2: Dictionary and Postings
CS276 Information Retrieval and Web Search Lecture 2: Dictionary and Postings Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in construction: Sorting
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 4: Indexing April 27, 2010 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Recap: Inverted Indexes
More informationData-analysis and Retrieval Boolean retrieval, posting lists and dictionaries
Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries Hans Philippi (based on the slides from the Stanford course on IR) April 25, 2018 Boolean retrieval, posting lists & dictionaries
More informationCS105 Introduction to Information Retrieval
CS105 Introduction to Information Retrieval Lecture: Yang Mu UMass Boston Slides are modified from: http://www.stanford.edu/class/cs276/ Information Retrieval Information Retrieval (IR) is finding material
More informationInformation Retrieval. Chap 7. Text Operations
Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationIN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)
IN4325 Indexing and query processing Claudia Hauff (WIS, TU Delft) The big picture Information need Topic the user wants to know more about The essence of IR Query Translation of need into an input for
More informationCourse work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?
Course work Introduc)on to Informa(on Retrieval Problem set 1 due Thursday Programming exercise 1 will be handed out today CS276: Informa)on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan
More informationQuery Evaluation Strategies
Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationCOMP6237 Data Mining Searching and Ranking
COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001
More informationCS60092: Informa0on Retrieval
Introduc)on to CS60092: Informa0on Retrieval Sourangshu Bha1acharya Last lecture index construc)on Sort- based indexing Naïve in- memory inversion Blocked Sort- Based Indexing Merge sort is effec)ve for
More informationIR System Components. Lecture 2: Data structures and Algorithms for Indexing. IR System Components. IR System Components
IR System Components Lecture 2: Data structures and Algorithms for Indexing Information Retrieval Computer Science Tripos Part II Document Collection Ronan Cummins 1 Natural Language and Information Processing
More informationLogistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search
CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects
More informationWeb Page Similarity Searching Based on Web Content
Web Page Similarity Searching Based on Web Content Gregorius Satia Budhi Informatics Department Petra Chistian University Siwalankerto 121-131 Surabaya 60236, Indonesia (62-31) 2983455 greg@petra.ac.id
More informationReuters collection example (approximate # s)
BSBI Reuters collection example (approximate # s) 800,000 documents from the Reuters news feed 200 terms per document 400,000 unique terms number of postings 100,000,000 BSBI Reuters collection example
More informationDocument Representation : Quiz
Document Representation : Quiz Q1. In-memory Index construction faces following problems:. (A) Scaling problem (B) The optimal use of Hardware resources for scaling (C) Easily keep entire data into main
More informationText Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering
Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationWeb Information Retrieval Exercises Boolean query answering. Prof. Luca Becchetti
Web Information Retrieval Exercises Boolean query answering Prof. Luca Becchetti Material rif 3. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schueze, Introduction to Information Retrieval, Cambridge
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationCS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University
CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου
Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs
More information2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response
CMSC 476/676 Review 1. Week 1 Overview of Information Retrieval a. Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.
More information60-538: Information Retrieval
60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are
More informationLexical Analysis. Lecture 3-4
Lexical Analysis Lecture 3-4 Notes by G. Necula, with additions by P. Hilfinger Prof. Hilfinger CS 164 Lecture 3-4 1 Administrivia I suggest you start looking at Python (see link on class home page). Please
More informationIntroducing Information Retrieval and Web Search. borrowing from: Pandu Nayak
Introducing Information Retrieval and Web Search borrowing from: Pandu Nayak Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually
More information