Βαθμολόγηση, Διαβάθμιση όρων και το μοντέλο διανυσματικού χώρου

Size: px
Start display at page:

Download "Βαθμολόγηση, Διαβάθμιση όρων και το μοντέλο διανυσματικού χώρου"

Transcription

1 Βαθμολόγηση, Διαβάθμιση όρων και το μοντέλο διανυσματικού χώρου

2 Γιατί η διαβάθμιση; Μέχρι τώρα έχουμε δει ευρετήρια τα οποία υποστηρίζουν Μπούλειo αναζήτηση: ένα έγγραφο είτε ταιριάζει είτε δεν ταιριάζει με κάποια επερώτηση Σε μεγάλες συλλογές η Μπούλειος αναζήτηση μπορεί να μας δίνει τεράστιο πλήθος απαντήσεων.. Είναι λοιπόν σημαντικό να μπορούμε να ιεραρχούμε τις απαντήσεις, δίνοντας κάποιο βαθμό (score) στα ζεύγη (query,document) 2

3 This lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring Term weighting Vector space retrieval

4 Metadata Digital documents usually contain certain metadata associated with each document: data about a document, such as author, title, date of publication, abstract etc. Fields: metadata entities whose possible values are finite; e.g. the set of all dates of authorship, the set of all names etc. Zones: entities whose values are typically an arbitrary, unbounded amount of text e.g., document titles and abstracts 4

5 Metadata (ctd) Fields Language = French Format = pdf Subject = Physics etc. Date = Feb 2000 Values

6 Parametric search A parametric search interface allows the user to combine a full-text query with selections on field values e.g., language, date range, etc.

7 Parametric search example Notice that the output is a (large) table. Various parameters in the table (column headings) may be clicked on to effect a sort.

8 Parametric search example We can add text search.

9 Parametric/field search In these examples, we select field values Values can be hierarchical, e.g., Geography: Continent Country State City A paradigm for navigating through the document collection, e.g., Aerospace companies in Brazil can be arrived at first by selecting Geography then Line of Business, or vice versa Filter docs in contention and run text searches scoped to subset

10 Index support for parametric search Must be able to support queries of the form Find pdf documents that contain stanford university A field selection (on doc format) and a phrase query Field selection use inverted index of: <field values docids> Organized by field name Use compression etc. as before

11 Parametric index support Optional provide richer search on field values e.g., wildcards Find books whose Author field contains s*trup Range search find docs authored between September and December Inverted index doesn t work (as well) Use techniques from database range search See for instance for a summary of B-trees Use query optimization heuristics as before

12 Zones A zone is an identified region within a doc E.g., Title, Abstract, Bibliography Generally culled from marked-up input or document metadata (e.g., powerpoint) Contents of a zone are free text Not a finite vocabulary Indexes for each zone - allow queries like sorting in Title AND smith in Bibliography AND recur* in Body Not queries like all papers whose authors cite themselves Why?

13 Zone indexes simple view Title Author Body

14 Zone indexes 14

15 So we have a database now? Not really. Databases do lots of things we don t need Transactions Recovery (our index is not the system of record; if it breaks, simply reconstruct from the original source) Indeed, we never have to store text in a search engine only indexes We re focusing on optimized indexes for textoriented queries, not an SQL engine.

16 Scoring and Ranking

17 Το ερώτημα Πώς μπορούμε να επιλέξουμε τις απαντήσεις μιας αναζήτησης, οι οποίες να είναι περισσότερο συναφείς με τις πληροφορίες που αναζητά ο χρήστης; 17

18 Scoring Thus far, our queries have all been Boolean Docs either match or not Good for expert users with precise understanding of their needs and the corpus Applications can consume 1000 s of results Not good for (the majority of) users with poor Boolean formulation of their needs Most users don t want to wade through 1000 s of results cf. use of web search engines

19 Scoring We wish to return in order the documents most likely to be useful to the searcher How can we rank order the docs in the corpus with respect to a query? Assign a score say in [0,1] for each doc on each query Begin with a perfect world no spammers Nobody stuffing keywords into a doc to make it match queries More on adversarial IR under web search

20 Weighted zone scoring (βαθμολόγηση ζώνης) Given a Boolean query q and a document d, weighted zone scoring assigns to the pair (q,d) a score in the interval [0,1]. To this end, it computes a linear combination of zone scores, where each zone of the document contributes a Boolean value to the combination. Assume a set of docs, each of which with l zones.... Di Di.z1 Di.z2 Di.zl q Si g1 x {0,1} + g2 x {0,1} gl x {0,1} 20

21 Formal definition Let Boolean query q and a document d with l zones l Let g1,...,gl [0,1], such that gi = 1 i=1 Let si (1 i l) be the Boolean score denoting a match (or absence thereof) between query q and the i-th zone of document d: si is a Boolean function that maps the presence of query terms in a zone to 0, 1 The weighted zone score is defined to be: l gi. si 1 21

22 Linear zone combinations First generation of scoring methods: use a linear combination of Booleans: E.g. Score = 0.6*<sorting in Title> + 0.3*<sorting in Abstract> *<sorting in Body> *<sorting in Boldface> Each expression such as <sorting in Title> takes on a value in {0,1}. Then the overall score is in [0,1]. For this example the scores can only take on a finite set of values what are they?

23 Linear zone combinations In fact, the expressions between <> on the last slide could be any Boolean query Who generates the Score expression (with weights such as 0.6 etc.)? In uncommon cases the user, in the UI Most commonly, a query parser that takes the user s Boolean query and runs it on the indexes for each zone

24 Exercise On the query bill OR rights suppose that we retrieve the following docs from the various zone indexes: Author Title Body bill rights bill rights bill rights Compute the score for each doc based on the weightings 0.6,0.3,0.1

25 General idea We are given a weight vector whose components sum up to 1. There is a weight for each zone/field. Given a Boolean query, we assign a score to each doc by adding up the weighted contributions of the zones/fields. Typically users want to see the K highestscoring docs.

26 Index support for zone combinations In the simplest version we have a separate inverted index for each zone Variant: have a single index with a separate dictionary entry for each term and zone E.g., bill.author bill.title bill.body Of course, compress zone names like author/title/body.

27 Zone combinations index The above scheme is still wasteful: each term is potentially replicated for each zone In a slightly better scheme, we encode the zone in the postings: bill 1.author, 1.body 2.author, 2.body 3.title As before, the zone names get compressed. At query time, accumulate contributions to the total score of a document from the various postings, e.g.,

28 Score accumulation bill 1.author, 1.body 2.author, 2.body 3.title rights 3.title, 3.body 5.title, 5.body As we walk the postings for the query bill OR rights, we accumulate scores for each doc in a linear merge as before. Note: we get both bill and rights in the Title field of doc 3, but score it no higher. Should we give more weight to more hits?

29 Algorithm for computing weighted score Compute weighted scores directly from inverted indexes assuming query q is a 2-term query consisting l of query terms q1 and q2. compute gi. si 1 29

30 Learning weights Εκμάθηση βαρών

31 Where do these weights come from? Machine learned relevance: a general class of approaches to scoring and ranking in information retrieval Given A test corpus A suite of test queries A set of relevance judgments: in its simplest form, a relevance judgment is either Relevant or Nonrelevant Learn a set of weights such that relevance judgments matched Can be formulated as ordinal regression

32 Learning for weighted zone scoring Problem formulation: Learning a linear function of the Boolean match scored contributed by the various zones. Expensive component: labor-intensive assembly of user-generated relevance judgments from which to learn the weights The learning problem can be reduced to a simple optimization problem 32

33 Learning weights Simple case of weighted zone scoring each document has a title and body zones Given a query q and a document d, use the Boolean match function to compute st(d,q) and sb(d,q), depending on whether the title/body zone of d matches q, and then: score(d,q) = g. st(d,q) + (1-g). sb(d,q) g is the parameter that we want to estimate, using a set of training examples Φj = (dj,qj,r(dj,qj)) where r(dj,qj) is the relevance judgment from a human editor 33

34 Training examples We are given training examples, each of which is a triple: DocID d, Query q and Judgment Relevant/Non. From these, we will learn the best value of w. 34

35 Learning weights How do we estimate g??? Given some g, the score for each Φj is: score(dj,qj) = g. st(dj, qj) + (1-g). sb(dj, qj) We want to find a value g, such that: Σ ε(g, Φ j ) is minimum, where: j ε(g, Φj) = (r(dj,qj) - score(dj,qj)) 2 35

36 Learning weights Let n01r / n01n denote the number of training examples for which: st(dj, qj)=0 sb(dj, qj)=1 the editorial judgment is Relevant / Non-relevant The contribution to the total error from training examples for which [st(dj, qj)=0, sb(dj, qj)=1] and the editorial judgment is [Relevant/Non-relevant], is: Taking into account all combinations of values, we get a total error of: 36

37 Learning weights This is a quadratic function f(g) How do we find its minimum? df(g)/dg = 0 37

38 Scoring: density-based Thus far: position and overlap of terms in a document - title, author, etc. Obvious next idea: if a document talks about a topic more, then it is a better match This applies even when we only have a single query term. Document relevant if it has many occurrences of the term(s) This leads to the idea of term weighting. 38

39 Term frequency and weighting

40 Weighted Zone Scoring for Boolean Queries Let Boolean query q and a document d with l zones si (1 i l) is a Boolean value denoting a match between q and the i-th zone of d: The weighted zone score is defined to be: score(q,d) = gi. si where g1,...,gl [0,1] and gi = 1 l 1 40

41 Term Frequency A document or zone that mentions a query term more often has more to do with that query therefore, it should receive a higher score Each term in a document can be assigned a weight that depends on the number of occurrences of the term in the document (term frequency, tff,d) WARNING: frequency is used to mean count Thus term frequency in IR literature is used to mean number of occurrences in a doc Not divided by document length (which would actually make it a frequency) 41

42 Bag of words Documents can be modeled as a bag of words the exact ordering of words is ignored we retain only information on the number of occurrences of each term we assume that two documents with the similar bag of words representation are similar in content Thus the doc John is quicker than Mary. is indistinguishable from the doc Mary is quicker than John. Which of the indexes discussed so far distinguish these two docs? 42

43 Which words count? For a free text query q=(q1,...,ql), all terms in q are equally important, when assessing relevancy of q to d However, certain terms have little or no discriminating power in determining relevance - they may appear to almost all documents of a collection 43

44 Which words count? Consider the ides of march query on Shakespear s works Julius Caesar has 5 occurrences of ides No other play has ides march occurs in over a dozen All the plays contain of By this scoring measure, the top-scoring play is likely to be the one with the most ofs 44

45 Term frequency tf Long docs are favored because they re more likely to contain query terms Can fix this to some extent by normalizing for document length But is raw tf the right measure?

46 Weighting term frequency: tf What is the relative importance of 0 vs. 1 occurrence of a term in a doc 1 vs. 2 occurrences 2 vs. 3 occurrences Unclear: while it seems that more is better, a lot isn t proportionally better than a few Can just use raw tf Another option commonly used in practice:

47 Score computation Score for a query q = sum over terms t in q: [Note: 0 if no query terms in document] This score can be zone-combined Can use wf instead of tf in the above Still doesn t consider term scarcity in collection (ides is rarer than of)

48 Weighting should depend on the term overall Which of these tells you more about a doc? 10 occurrences of hernia? 10 occurrences of the? Would like to attenuate the weight of a common term But what is common?

49 Metrics of common Collection frequency (cf ) = total number of occurrences of the term in the entire collection of documents Document frequency (df ) = number of docs in the corpus containing the term df is preferable over cf: Word cf df ferrari insurance the few docs that contain ferrari should get a higher boost for a query on ferrari than the docs that contain insurance get from a query on insurance 49

50 Inverse Document Frequency Inverse document frequency (idf ) measure of informativeness of a term: its rarity across the whole corpus could just be raw count of number of documents the term occurs in (idf i = 1/df i ) but by far the most commonly used version is: 50

51 tf x idf term weights Combine term frequency (or wf, or some measure of term density) and inverse document frequency, to produce a composite weight for each term in each document tf-idft,d = tft,d x idft tf-idf is: highest when t occurs many times within a small number of documents (lending high discriminating power to those docs) lower when the term occurs fewer times in a doc, or occurs in many documents (thus offering less pronounced relevance signal) lowest when the term occurs in virtually all documents

52 Summary: tf x idf (or tf.idf) Assign a tf.idf weight to each term i in each document d What is the wt of a term that occurs in all of the docs? Increases with the number of occurrences within a doc Increases with the rarity of the term across the whole corpus

53 Real-valued term vectors Still Bag of words model Each is a vector in R M Here log-scaled tf.idf 53

54 Άσκηση Θεωρείστε τον ακόλουθο πίνακα με τις συχνότητες όρων γιατρίααρχείαdoc1, Doc2, Doc3 Υπολογίστε τα βάρη tf-idf για τους όρους car, auto, insurance, best για κάθε αρχείο, χρησιμοποιώντας τις τιμές idf που δίνονται στον δεύτερο πίνακα tf

55 Τιμές tf-idf Doc1 Doc2 Doc3 car auto insuranc e best

56 Vector Space Model Each doc j can now be viewed as a vector of wf idf values, one component for each term So we have a vector space terms are axes docs live in this space even with stemming, may have 20,000+ dimensions We denote by V(d) the vector derived from document d, with one component of the vector for each dictionary term components are computed using the tf-idf weighting scheme

57 Why turn docs into vectors? First application: query by example Given a doc D, find others like it With D being expressed as a vector, we need to find other vectors (docs) near it Assumption: documents that are close together in the vector space, talk about the same things 57

58 Similarity metrics How do we quantify the similarity between two documents in the Vector Space model? Take the difference between the two vectors not a solution (why?) A-B -B A B 58

59 Cosine similarity Distance between vectors V(d1) and V(d 2 ) captured by the cosine of the angle θ between them. Note this is similarity, not distance No triangle inequality for similarity. t 3 d 2 θ d 1 t 1 t 2

60 Cosine similarity Mathematical definition: sim(d1,d2) = V(d1). V(d2) / V(d1). V(d2) Cosine of angle between two vectors Numerator: inner product of two vectors Denominator: product of Euclidean lengths of the vectors. The effect of the denominator is to normalize the length of the two vectors: v(d1) = V(d1)/ V(d1) is a unit vector Longer documents don t get more weight

61 Example Docs: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights term frequencies cos(sas, PAP) =.996 x x x 0.0 = cos(sas, WH) =.996 x x x.254 = term vectors (taking into consideration only raw tf s and normalizing for only the 3 terms)

62 Using the cosine similarity When searching for the documents in a collection that are most similar to a document d: we reduce this problem to that of finding the documents di with the highest inner product v(di).v(d) 62

63 Term-document matrix Viewing a collection of N documents as a collection of vectors leads to a view of the collection as a MxN matrix: M rows: represent the M terms N columns: represent the N documents Each is a vector in R v Here log-scaled tf.idf

64 Queries as vectors Given a free text query q: view this as a bag of words, treating it as a very short document compute the query vector v(q) Answers to this query are documents of the collection that have a high cosine similarity value with v(q) In other words, we score documents as follows: score(q,d) = V(q). V(d) / V(q). V(d) 64

65 Computing vector scores Collection of documents, each represented by a vector Free text query, represented by a vector Positive integer K Seek K docs of the collection with the highest vector space scores on the given query Basic algorithm follows: Length[N]: array holding the lengths (normalization factors) for each of the N documents Scores[N]: scores for each document 65

66 Basic cosine score computation Calculate weight for query term t Where is this stored? Update score of all documents d that contain t, by adding its contribution to sim(q,d) Normalize

67 Notes on score computation Why not store normalized tf-idf value at each postings entry? idf the same for all postings entries in a single postings list (term) What about tf/normalization? Space blowup because of floats tf s stored as integers Lengths stored once per doc

68 Document & query weighting schemes Several variations of vector space scoring exist The SMART notation represents different variations with the form ddd.qqq ddd: encodes the term weighting of a doc vector qqq: encodes the weighting in the query vector term frequency document frequency normalization 68

69 Back-up slides

70 Full text queries We just scored the Boolean query bill OR rights Most users more likely to type bill rights or bill of rights How do we interpret these full text queries? No Boolean connectives Of several query terms some may be missing in a doc Only some query terms may occur in the title, etc.

71 Full text queries To use zone combinations for free text queries, we need A way of assigning a score to a pair <free text query, zone> Zero query terms in the zone should mean a zero score More query terms in the zone should mean a higher score Scores don t have to be Boolean Will look at some alternatives now

72 Incidence matrices Recall: Document (or a zone in it) is binary vector X in {0,1} v Query is a vector Score: Overlap measure:

73 Example On the query ides of march, Shakespeare s Julius Caesar has a score of 3 All other Shakespeare plays have a score of 2 (because they contain march) or 1 Thus in a rank order, Julius Caesar would come out tops

74 Overlap matching What s wrong with the overlap measure? It doesn t consider: Term frequency in document Term scarcity in collection (document mention frequency) of is more common than ides or march Length of documents (And queries: score not normalized)

75 Overlap matching One can normalize in various ways: Jaccard coefficient: Cosine measure: What documents would score best using Jaccard against a typical query? Does the cosine measure fix this problem?

76 Scoring: density-based Thus far: position and overlap of terms in a doc title, author etc. Obvious next: idea if a document talks about a topic more, then it is a better match This applies even when we only have a single query term. Document relevant if it has a lot of the terms This leads to the idea of term weighting.

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 06 Scoring, Term Weighting and the Vector Space Model 1 Recap of lecture 5 Collection and vocabulary statistics: Heaps and Zipf s laws Dictionary

More information

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring This lecture: IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring 1 Ch. 6 Ranked retrieval Thus far, our queries have all

More information

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs

More information

Informa(on Retrieval

Informa(on Retrieval Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 7: Scoring, Term Weigh9ng and the Vector Space Model 7 Last Time: Index Construc9on Sort- based indexing Blocked Sort- Based Indexing

More information

Informa(on Retrieval

Informa(on Retrieval Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 7: Scoring, Term Weigh9ng and the Vector Space Model 7 Last Time: Index Compression Collec9on and vocabulary sta9s9cs: Heaps and

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural

More information

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression. Sec. 5.2 FRONT CODING Front-coding: Sorted words commonly have long common prefix store differences only (for last k-1 in a block of k) 8automata8automate9automatic10automation 8automat*a1 e2 ic3 ion Encodes

More information

Information Retrieval

Information Retrieval Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

Digital Libraries: Language Technologies

Digital Libraries: Language Technologies Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 6-: Scoring, Term Weighting Outline Why ranked retrieval? Term frequency tf-idf weighting 2 Ranked retrieval Thus far, our queries have all been Boolean. Documents

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Lecture 5: Information Retrieval using the Vector Space Model

Lecture 5: Information Retrieval using the Vector Space Model Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query

More information

Scoring, term weighting, and the vector space model

Scoring, term weighting, and the vector space model 6 Scoring, term weighting, and the vector space model Thus far, we have dealt with indexes that support Boolean queries: A document either matches or does not match a query. In the case of large document

More information

Models for Document & Query Representation. Ziawasch Abedjan

Models for Document & Query Representation. Ziawasch Abedjan Models for Document & Query Representation Ziawasch Abedjan Overview Introduction & Definition Boolean retrieval Vector Space Model Probabilistic Information Retrieval Language Model Approach Summary Overview

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

Outline of the course

Outline of the course Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library

More information

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007 Information Retrieval Lecture 5 - The vector space model Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 28 Introduction Boolean model: all documents

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep

More information

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics

More information

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Spamming indices This was invented

More information

Chapter III.2: Basic ranking & evaluation measures

Chapter III.2: Basic ranking & evaluation measures Chapter III.2: Basic ranking & evaluation measures 1. TF-IDF and vector space model 1.1. Term frequency counting with TF-IDF 1.2. Documents and queries as vectors 2. Evaluating IR results 2.1. Evaluation

More information

Q: Given a set of keywords how can we return relevant documents quickly?

Q: Given a set of keywords how can we return relevant documents quickly? Keyword Search Traditional B+index is good for answering 1-dimensional range or point query Q: What about keyword search? Geo-spatial queries? Q: Documents on Computer Science? Q: Nearby coffee shops?

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

COMP6237 Data Mining Searching and Ranking

COMP6237 Data Mining Searching and Ranking COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 7: Scores in a Complete Search System Paul Ginsparg Cornell University, Ithaca,

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

Data Modelling and Multimedia Databases M

Data Modelling and Multimedia Databases M ALMA MATER STUDIORUM - UNIERSITÀ DI BOLOGNA Data Modelling and Multimedia Databases M International Second cycle degree programme (LM) in Digital Humanities and Digital Knoledge (DHDK) University of Bologna

More information

Informa(on Retrieval

Informa(on Retrieval Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 8: A complete search system Scoring and results assembly 8 Ch. 6 Last Time: @- idf weighdng The @- idf weight of a term is the product

More information

Information Retrieval. hussein suleman uct cs

Information Retrieval. hussein suleman uct cs Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

Ranking and Learning. Table of Content. Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification.

Ranking and Learning. Table of Content. Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification. Table of Content anking and Learning Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification 290 UCSB, Tao Yang, 2013 Partially based on Manning, aghavan,

More information

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML CS276B Text Retrieval and Mining Winter 2005 Plan for today Vector space approaches to XML retrieval Evaluating text-centric retrieval Lecture 15 Text-centric XML retrieval Documents marked up as XML E.g.,

More information

Lecture 8 May 7, Prabhakar Raghavan

Lecture 8 May 7, Prabhakar Raghavan Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

Classic IR Models 5/6/2012 1

Classic IR Models 5/6/2012 1 Classic IR Models 5/6/2012 1 Classic IR Models Idea Each document is represented by index terms. An index term is basically a (word) whose semantics give meaning to the document. Not all index terms are

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

CS105 Introduction to Information Retrieval

CS105 Introduction to Information Retrieval CS105 Introduction to Information Retrieval Lecture: Yang Mu UMass Boston Slides are modified from: http://www.stanford.edu/class/cs276/ Information Retrieval Information Retrieval (IR) is finding material

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

The Web document collection

The Web document collection Web Data Management Part 1 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

Foundations of Computing

Foundations of Computing Foundations of Computing Darmstadt University of Technology Dept. Computer Science Winter Term 2005 / 2006 Copyright c 2004 by Matthias Müller-Hannemann and Karsten Weihe All rights reserved http://www.algo.informatik.tu-darmstadt.de/

More information

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 11, 2017 Sham Kakade 2017 1 Document

More information

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with

More information

Midterm Exam Search Engines ( / ) October 20, 2015

Midterm Exam Search Engines ( / ) October 20, 2015 Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

Part 2: Boolean Retrieval Francesco Ricci

Part 2: Boolean Retrieval Francesco Ricci Part 2: Boolean Retrieval Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content p Term document matrix p Information

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval Outline ❶ Course details ❷ Information retrieval ❸ Boolean retrieval 2 Course details

More information

Vector Space Models. Jesse Anderton

Vector Space Models. Jesse Anderton Vector Space Models Jesse Anderton CS6200: Information Retrieval In the first module, we introduced Vector Space Models as an alternative to Boolean Retrieval. This module discusses VSMs in a lot more

More information

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS] Search Evaluation Tao Yang CS293S Slides partially based on text book [CMS] [MRS] Table of Content Search Engine Evaluation Metrics for relevancy Precision/recall F-measure MAP NDCG Difficulties in Evaluating

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query

More information

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Session 10: Information Retrieval

Session 10: Information Retrieval INFM 63: Information Technology and Organizational Context Session : Information Retrieval Jimmy Lin The ischool University of Maryland Thursday, November 7, 23 Information Retrieval What you search for!

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 Information Retrieval Lecture 2: Boolean retrieval 2 Blanks on slides, you may want to fill in Last Time: Ngram Language Models Unigram LM: Bag of words Ngram

More information

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology More on indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Plan

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 45/65 43/63 (Winter 08) Part 3: Analyzing Text (/) January 30, 08 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are

More information

Indexing and Query Processing. What will we cover?

Indexing and Query Processing. What will we cover? Indexing and Query Processing CS 510 Winter 2007 1 What will we cover? Key concepts and terminology Inverted index structures Organization, creation, maintenance Compression Distribution Answering queries

More information

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures

More information

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information

More information

Introduction to Computational Advertising. MS&E 239 Stanford University Autumn 2010 Instructors: Andrei Broder and Vanja Josifovski

Introduction to Computational Advertising. MS&E 239 Stanford University Autumn 2010 Instructors: Andrei Broder and Vanja Josifovski Introduction to Computational Advertising MS&E 239 Stanford University Autumn 2010 Instructors: Andrei Broder and Vanja Josifovski 1 Lecture 4: Sponsored Search (part 2) 2 Disclaimers This talk presents

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Information Retrieval CSCI

Information Retrieval CSCI Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1

More information

Notes: Notes: Primo Ranking Customization

Notes: Notes: Primo Ranking Customization Primo Ranking Customization Hello, and welcome to today s lesson entitled Ranking Customization in Primo. Like most search engines, Primo aims to present results in descending order of relevance, with

More information

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document

More information

Relevance of a Document to a Query

Relevance of a Document to a Query Relevance of a Document to a Query Computing the relevance of a document to a query has four parts: 1. Computing the significance of a word within document D. 2. Computing the significance of word to document

More information

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Overview Boolean Retrieval Weighted Boolean Retrieval Zone Indices

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression

More information

Query Processing and Alternative Search Structures. Indexing common words

Query Processing and Alternative Search Structures. Indexing common words Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such

More information

Text Retrieval an introduction

Text Retrieval an introduction Text Retrieval an introduction Michalis Vazirgiannis Nov. 2012 Outline Document collection preprocessing Feature Selection Indexing Query processing & Ranking Text representation for Information Retrieval

More information

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions

More information

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Kwangcheol Shin 1, Sang-Yong Han 1, and Alexander Gelbukh 1,2 1 Computer Science and Engineering Department, Chung-Ang University,

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview

More information