Βαθμολόγηση, Διαβάθμιση όρων και το μοντέλο διανυσματικού χώρου
|
|
- Oswin McDowell
- 5 years ago
- Views:
Transcription
1 Βαθμολόγηση, Διαβάθμιση όρων και το μοντέλο διανυσματικού χώρου
2 Γιατί η διαβάθμιση; Μέχρι τώρα έχουμε δει ευρετήρια τα οποία υποστηρίζουν Μπούλειo αναζήτηση: ένα έγγραφο είτε ταιριάζει είτε δεν ταιριάζει με κάποια επερώτηση Σε μεγάλες συλλογές η Μπούλειος αναζήτηση μπορεί να μας δίνει τεράστιο πλήθος απαντήσεων.. Είναι λοιπόν σημαντικό να μπορούμε να ιεραρχούμε τις απαντήσεις, δίνοντας κάποιο βαθμό (score) στα ζεύγη (query,document) 2
3 This lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring Term weighting Vector space retrieval
4 Metadata Digital documents usually contain certain metadata associated with each document: data about a document, such as author, title, date of publication, abstract etc. Fields: metadata entities whose possible values are finite; e.g. the set of all dates of authorship, the set of all names etc. Zones: entities whose values are typically an arbitrary, unbounded amount of text e.g., document titles and abstracts 4
5 Metadata (ctd) Fields Language = French Format = pdf Subject = Physics etc. Date = Feb 2000 Values
6 Parametric search A parametric search interface allows the user to combine a full-text query with selections on field values e.g., language, date range, etc.
7 Parametric search example Notice that the output is a (large) table. Various parameters in the table (column headings) may be clicked on to effect a sort.
8 Parametric search example We can add text search.
9 Parametric/field search In these examples, we select field values Values can be hierarchical, e.g., Geography: Continent Country State City A paradigm for navigating through the document collection, e.g., Aerospace companies in Brazil can be arrived at first by selecting Geography then Line of Business, or vice versa Filter docs in contention and run text searches scoped to subset
10 Index support for parametric search Must be able to support queries of the form Find pdf documents that contain stanford university A field selection (on doc format) and a phrase query Field selection use inverted index of: <field values docids> Organized by field name Use compression etc. as before
11 Parametric index support Optional provide richer search on field values e.g., wildcards Find books whose Author field contains s*trup Range search find docs authored between September and December Inverted index doesn t work (as well) Use techniques from database range search See for instance for a summary of B-trees Use query optimization heuristics as before
12 Zones A zone is an identified region within a doc E.g., Title, Abstract, Bibliography Generally culled from marked-up input or document metadata (e.g., powerpoint) Contents of a zone are free text Not a finite vocabulary Indexes for each zone - allow queries like sorting in Title AND smith in Bibliography AND recur* in Body Not queries like all papers whose authors cite themselves Why?
13 Zone indexes simple view Title Author Body
14 Zone indexes 14
15 So we have a database now? Not really. Databases do lots of things we don t need Transactions Recovery (our index is not the system of record; if it breaks, simply reconstruct from the original source) Indeed, we never have to store text in a search engine only indexes We re focusing on optimized indexes for textoriented queries, not an SQL engine.
16 Scoring and Ranking
17 Το ερώτημα Πώς μπορούμε να επιλέξουμε τις απαντήσεις μιας αναζήτησης, οι οποίες να είναι περισσότερο συναφείς με τις πληροφορίες που αναζητά ο χρήστης; 17
18 Scoring Thus far, our queries have all been Boolean Docs either match or not Good for expert users with precise understanding of their needs and the corpus Applications can consume 1000 s of results Not good for (the majority of) users with poor Boolean formulation of their needs Most users don t want to wade through 1000 s of results cf. use of web search engines
19 Scoring We wish to return in order the documents most likely to be useful to the searcher How can we rank order the docs in the corpus with respect to a query? Assign a score say in [0,1] for each doc on each query Begin with a perfect world no spammers Nobody stuffing keywords into a doc to make it match queries More on adversarial IR under web search
20 Weighted zone scoring (βαθμολόγηση ζώνης) Given a Boolean query q and a document d, weighted zone scoring assigns to the pair (q,d) a score in the interval [0,1]. To this end, it computes a linear combination of zone scores, where each zone of the document contributes a Boolean value to the combination. Assume a set of docs, each of which with l zones.... Di Di.z1 Di.z2 Di.zl q Si g1 x {0,1} + g2 x {0,1} gl x {0,1} 20
21 Formal definition Let Boolean query q and a document d with l zones l Let g1,...,gl [0,1], such that gi = 1 i=1 Let si (1 i l) be the Boolean score denoting a match (or absence thereof) between query q and the i-th zone of document d: si is a Boolean function that maps the presence of query terms in a zone to 0, 1 The weighted zone score is defined to be: l gi. si 1 21
22 Linear zone combinations First generation of scoring methods: use a linear combination of Booleans: E.g. Score = 0.6*<sorting in Title> + 0.3*<sorting in Abstract> *<sorting in Body> *<sorting in Boldface> Each expression such as <sorting in Title> takes on a value in {0,1}. Then the overall score is in [0,1]. For this example the scores can only take on a finite set of values what are they?
23 Linear zone combinations In fact, the expressions between <> on the last slide could be any Boolean query Who generates the Score expression (with weights such as 0.6 etc.)? In uncommon cases the user, in the UI Most commonly, a query parser that takes the user s Boolean query and runs it on the indexes for each zone
24 Exercise On the query bill OR rights suppose that we retrieve the following docs from the various zone indexes: Author Title Body bill rights bill rights bill rights Compute the score for each doc based on the weightings 0.6,0.3,0.1
25 General idea We are given a weight vector whose components sum up to 1. There is a weight for each zone/field. Given a Boolean query, we assign a score to each doc by adding up the weighted contributions of the zones/fields. Typically users want to see the K highestscoring docs.
26 Index support for zone combinations In the simplest version we have a separate inverted index for each zone Variant: have a single index with a separate dictionary entry for each term and zone E.g., bill.author bill.title bill.body Of course, compress zone names like author/title/body.
27 Zone combinations index The above scheme is still wasteful: each term is potentially replicated for each zone In a slightly better scheme, we encode the zone in the postings: bill 1.author, 1.body 2.author, 2.body 3.title As before, the zone names get compressed. At query time, accumulate contributions to the total score of a document from the various postings, e.g.,
28 Score accumulation bill 1.author, 1.body 2.author, 2.body 3.title rights 3.title, 3.body 5.title, 5.body As we walk the postings for the query bill OR rights, we accumulate scores for each doc in a linear merge as before. Note: we get both bill and rights in the Title field of doc 3, but score it no higher. Should we give more weight to more hits?
29 Algorithm for computing weighted score Compute weighted scores directly from inverted indexes assuming query q is a 2-term query consisting l of query terms q1 and q2. compute gi. si 1 29
30 Learning weights Εκμάθηση βαρών
31 Where do these weights come from? Machine learned relevance: a general class of approaches to scoring and ranking in information retrieval Given A test corpus A suite of test queries A set of relevance judgments: in its simplest form, a relevance judgment is either Relevant or Nonrelevant Learn a set of weights such that relevance judgments matched Can be formulated as ordinal regression
32 Learning for weighted zone scoring Problem formulation: Learning a linear function of the Boolean match scored contributed by the various zones. Expensive component: labor-intensive assembly of user-generated relevance judgments from which to learn the weights The learning problem can be reduced to a simple optimization problem 32
33 Learning weights Simple case of weighted zone scoring each document has a title and body zones Given a query q and a document d, use the Boolean match function to compute st(d,q) and sb(d,q), depending on whether the title/body zone of d matches q, and then: score(d,q) = g. st(d,q) + (1-g). sb(d,q) g is the parameter that we want to estimate, using a set of training examples Φj = (dj,qj,r(dj,qj)) where r(dj,qj) is the relevance judgment from a human editor 33
34 Training examples We are given training examples, each of which is a triple: DocID d, Query q and Judgment Relevant/Non. From these, we will learn the best value of w. 34
35 Learning weights How do we estimate g??? Given some g, the score for each Φj is: score(dj,qj) = g. st(dj, qj) + (1-g). sb(dj, qj) We want to find a value g, such that: Σ ε(g, Φ j ) is minimum, where: j ε(g, Φj) = (r(dj,qj) - score(dj,qj)) 2 35
36 Learning weights Let n01r / n01n denote the number of training examples for which: st(dj, qj)=0 sb(dj, qj)=1 the editorial judgment is Relevant / Non-relevant The contribution to the total error from training examples for which [st(dj, qj)=0, sb(dj, qj)=1] and the editorial judgment is [Relevant/Non-relevant], is: Taking into account all combinations of values, we get a total error of: 36
37 Learning weights This is a quadratic function f(g) How do we find its minimum? df(g)/dg = 0 37
38 Scoring: density-based Thus far: position and overlap of terms in a document - title, author, etc. Obvious next idea: if a document talks about a topic more, then it is a better match This applies even when we only have a single query term. Document relevant if it has many occurrences of the term(s) This leads to the idea of term weighting. 38
39 Term frequency and weighting
40 Weighted Zone Scoring for Boolean Queries Let Boolean query q and a document d with l zones si (1 i l) is a Boolean value denoting a match between q and the i-th zone of d: The weighted zone score is defined to be: score(q,d) = gi. si where g1,...,gl [0,1] and gi = 1 l 1 40
41 Term Frequency A document or zone that mentions a query term more often has more to do with that query therefore, it should receive a higher score Each term in a document can be assigned a weight that depends on the number of occurrences of the term in the document (term frequency, tff,d) WARNING: frequency is used to mean count Thus term frequency in IR literature is used to mean number of occurrences in a doc Not divided by document length (which would actually make it a frequency) 41
42 Bag of words Documents can be modeled as a bag of words the exact ordering of words is ignored we retain only information on the number of occurrences of each term we assume that two documents with the similar bag of words representation are similar in content Thus the doc John is quicker than Mary. is indistinguishable from the doc Mary is quicker than John. Which of the indexes discussed so far distinguish these two docs? 42
43 Which words count? For a free text query q=(q1,...,ql), all terms in q are equally important, when assessing relevancy of q to d However, certain terms have little or no discriminating power in determining relevance - they may appear to almost all documents of a collection 43
44 Which words count? Consider the ides of march query on Shakespear s works Julius Caesar has 5 occurrences of ides No other play has ides march occurs in over a dozen All the plays contain of By this scoring measure, the top-scoring play is likely to be the one with the most ofs 44
45 Term frequency tf Long docs are favored because they re more likely to contain query terms Can fix this to some extent by normalizing for document length But is raw tf the right measure?
46 Weighting term frequency: tf What is the relative importance of 0 vs. 1 occurrence of a term in a doc 1 vs. 2 occurrences 2 vs. 3 occurrences Unclear: while it seems that more is better, a lot isn t proportionally better than a few Can just use raw tf Another option commonly used in practice:
47 Score computation Score for a query q = sum over terms t in q: [Note: 0 if no query terms in document] This score can be zone-combined Can use wf instead of tf in the above Still doesn t consider term scarcity in collection (ides is rarer than of)
48 Weighting should depend on the term overall Which of these tells you more about a doc? 10 occurrences of hernia? 10 occurrences of the? Would like to attenuate the weight of a common term But what is common?
49 Metrics of common Collection frequency (cf ) = total number of occurrences of the term in the entire collection of documents Document frequency (df ) = number of docs in the corpus containing the term df is preferable over cf: Word cf df ferrari insurance the few docs that contain ferrari should get a higher boost for a query on ferrari than the docs that contain insurance get from a query on insurance 49
50 Inverse Document Frequency Inverse document frequency (idf ) measure of informativeness of a term: its rarity across the whole corpus could just be raw count of number of documents the term occurs in (idf i = 1/df i ) but by far the most commonly used version is: 50
51 tf x idf term weights Combine term frequency (or wf, or some measure of term density) and inverse document frequency, to produce a composite weight for each term in each document tf-idft,d = tft,d x idft tf-idf is: highest when t occurs many times within a small number of documents (lending high discriminating power to those docs) lower when the term occurs fewer times in a doc, or occurs in many documents (thus offering less pronounced relevance signal) lowest when the term occurs in virtually all documents
52 Summary: tf x idf (or tf.idf) Assign a tf.idf weight to each term i in each document d What is the wt of a term that occurs in all of the docs? Increases with the number of occurrences within a doc Increases with the rarity of the term across the whole corpus
53 Real-valued term vectors Still Bag of words model Each is a vector in R M Here log-scaled tf.idf 53
54 Άσκηση Θεωρείστε τον ακόλουθο πίνακα με τις συχνότητες όρων γιατρίααρχείαdoc1, Doc2, Doc3 Υπολογίστε τα βάρη tf-idf για τους όρους car, auto, insurance, best για κάθε αρχείο, χρησιμοποιώντας τις τιμές idf που δίνονται στον δεύτερο πίνακα tf
55 Τιμές tf-idf Doc1 Doc2 Doc3 car auto insuranc e best
56 Vector Space Model Each doc j can now be viewed as a vector of wf idf values, one component for each term So we have a vector space terms are axes docs live in this space even with stemming, may have 20,000+ dimensions We denote by V(d) the vector derived from document d, with one component of the vector for each dictionary term components are computed using the tf-idf weighting scheme
57 Why turn docs into vectors? First application: query by example Given a doc D, find others like it With D being expressed as a vector, we need to find other vectors (docs) near it Assumption: documents that are close together in the vector space, talk about the same things 57
58 Similarity metrics How do we quantify the similarity between two documents in the Vector Space model? Take the difference between the two vectors not a solution (why?) A-B -B A B 58
59 Cosine similarity Distance between vectors V(d1) and V(d 2 ) captured by the cosine of the angle θ between them. Note this is similarity, not distance No triangle inequality for similarity. t 3 d 2 θ d 1 t 1 t 2
60 Cosine similarity Mathematical definition: sim(d1,d2) = V(d1). V(d2) / V(d1). V(d2) Cosine of angle between two vectors Numerator: inner product of two vectors Denominator: product of Euclidean lengths of the vectors. The effect of the denominator is to normalize the length of the two vectors: v(d1) = V(d1)/ V(d1) is a unit vector Longer documents don t get more weight
61 Example Docs: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights term frequencies cos(sas, PAP) =.996 x x x 0.0 = cos(sas, WH) =.996 x x x.254 = term vectors (taking into consideration only raw tf s and normalizing for only the 3 terms)
62 Using the cosine similarity When searching for the documents in a collection that are most similar to a document d: we reduce this problem to that of finding the documents di with the highest inner product v(di).v(d) 62
63 Term-document matrix Viewing a collection of N documents as a collection of vectors leads to a view of the collection as a MxN matrix: M rows: represent the M terms N columns: represent the N documents Each is a vector in R v Here log-scaled tf.idf
64 Queries as vectors Given a free text query q: view this as a bag of words, treating it as a very short document compute the query vector v(q) Answers to this query are documents of the collection that have a high cosine similarity value with v(q) In other words, we score documents as follows: score(q,d) = V(q). V(d) / V(q). V(d) 64
65 Computing vector scores Collection of documents, each represented by a vector Free text query, represented by a vector Positive integer K Seek K docs of the collection with the highest vector space scores on the given query Basic algorithm follows: Length[N]: array holding the lengths (normalization factors) for each of the N documents Scores[N]: scores for each document 65
66 Basic cosine score computation Calculate weight for query term t Where is this stored? Update score of all documents d that contain t, by adding its contribution to sim(q,d) Normalize
67 Notes on score computation Why not store normalized tf-idf value at each postings entry? idf the same for all postings entries in a single postings list (term) What about tf/normalization? Space blowup because of floats tf s stored as integers Lengths stored once per doc
68 Document & query weighting schemes Several variations of vector space scoring exist The SMART notation represents different variations with the form ddd.qqq ddd: encodes the term weighting of a doc vector qqq: encodes the weighting in the query vector term frequency document frequency normalization 68
69 Back-up slides
70 Full text queries We just scored the Boolean query bill OR rights Most users more likely to type bill rights or bill of rights How do we interpret these full text queries? No Boolean connectives Of several query terms some may be missing in a doc Only some query terms may occur in the title, etc.
71 Full text queries To use zone combinations for free text queries, we need A way of assigning a score to a pair <free text query, zone> Zero query terms in the zone should mean a zero score More query terms in the zone should mean a higher score Scores don t have to be Boolean Will look at some alternatives now
72 Incidence matrices Recall: Document (or a zone in it) is binary vector X in {0,1} v Query is a vector Score: Overlap measure:
73 Example On the query ides of march, Shakespeare s Julius Caesar has a score of 3 All other Shakespeare plays have a score of 2 (because they contain march) or 1 Thus in a rank order, Julius Caesar would come out tops
74 Overlap matching What s wrong with the overlap measure? It doesn t consider: Term frequency in document Term scarcity in collection (document mention frequency) of is more common than ides or march Length of documents (And queries: score not normalized)
75 Overlap matching One can normalize in various ways: Jaccard coefficient: Cosine measure: What documents would score best using Jaccard against a typical query? Does the cosine measure fix this problem?
76 Scoring: density-based Thus far: position and overlap of terms in a doc title, author etc. Obvious next: idea if a document talks about a topic more, then it is a better match This applies even when we only have a single query term. Document relevant if it has a lot of the terms This leads to the idea of term weighting.
Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency
Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 06 Scoring, Term Weighting and the Vector Space Model 1 Recap of lecture 5 Collection and vocabulary statistics: Heaps and Zipf s laws Dictionary
More informationThis lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring
This lecture: IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring 1 Ch. 6 Ranked retrieval Thus far, our queries have all
More informationΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου
Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs
More informationInforma(on Retrieval
Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 7: Scoring, Term Weigh9ng and the Vector Space Model 7 Last Time: Index Construc9on Sort- based indexing Blocked Sort- Based Indexing
More informationInforma(on Retrieval
Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 7: Scoring, Term Weigh9ng and the Vector Space Model 7 Last Time: Index Compression Collec9on and vocabulary sta9s9cs: Heaps and
More informationIntroduction to Information Retrieval
Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural
More informationFRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.
Sec. 5.2 FRONT CODING Front-coding: Sorted words commonly have long common prefix store differences only (for last k-1 in a block of k) 8automata8automate9automatic10automation 8automat*a1 e2 ic3 ion Encodes
More informationInformation Retrieval
Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid
More informationInformation Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer
More informationDigital Libraries: Language Technologies
Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 6-: Scoring, Term Weighting Outline Why ranked retrieval? Term frequency tf-idf weighting 2 Ranked retrieval Thus far, our queries have all been Boolean. Documents
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationLecture 5: Information Retrieval using the Vector Space Model
Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query
More informationScoring, term weighting, and the vector space model
6 Scoring, term weighting, and the vector space model Thus far, we have dealt with indexes that support Boolean queries: A document either matches or does not match a query. In the case of large document
More informationModels for Document & Query Representation. Ziawasch Abedjan
Models for Document & Query Representation Ziawasch Abedjan Overview Introduction & Definition Boolean retrieval Vector Space Model Probabilistic Information Retrieval Language Model Approach Summary Overview
More informationBoolean Model. Hongning Wang
Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationInstructor: Stefan Savev
LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information
More informationOutline of the course
Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library
More informationInformation Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007
Information Retrieval Lecture 5 - The vector space model Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 28 Introduction Boolean model: all documents
More informationIntroduction to Information Retrieval
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute
More informationInformation Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes
CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten
More informationChapter 2. Architecture of a Search Engine
Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationBasic techniques. Text processing; term weighting; vector space model; inverted index; Web Search
Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep
More informationToday s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan
Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics
More informationVector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Spamming indices This was invented
More informationChapter III.2: Basic ranking & evaluation measures
Chapter III.2: Basic ranking & evaluation measures 1. TF-IDF and vector space model 1.1. Term frequency counting with TF-IDF 1.2. Documents and queries as vectors 2. Evaluating IR results 2.1. Evaluation
More informationQ: Given a set of keywords how can we return relevant documents quickly?
Keyword Search Traditional B+index is good for answering 1-dimensional range or point query Q: What about keyword search? Geo-spatial queries? Q: Documents on Computer Science? Q: Nearby coffee shops?
More informationInformation Retrieval: Retrieval Models
CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval
More informationXML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson
Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationCOMP6237 Data Mining Searching and Ranking
COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 7: Scores in a Complete Search System Paul Ginsparg Cornell University, Ithaca,
More informationKnowledge Discovery and Data Mining 1 (VO) ( )
Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability
More informationMultimedia Information Systems
Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive
More informationData Modelling and Multimedia Databases M
ALMA MATER STUDIORUM - UNIERSITÀ DI BOLOGNA Data Modelling and Multimedia Databases M International Second cycle degree programme (LM) in Digital Humanities and Digital Knoledge (DHDK) University of Bologna
More informationInforma(on Retrieval
Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 8: A complete search system Scoring and results assembly 8 Ch. 6 Last Time: @- idf weighdng The @- idf weight of a term is the product
More informationInformation Retrieval. hussein suleman uct cs
Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information
More informationBasic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert
More informationRanking and Learning. Table of Content. Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification.
Table of Content anking and Learning Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification 290 UCSB, Tao Yang, 2013 Partially based on Manning, aghavan,
More informationPlan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML
CS276B Text Retrieval and Mining Winter 2005 Plan for today Vector space approaches to XML retrieval Evaluating text-centric retrieval Lecture 15 Text-centric XML retrieval Documents marked up as XML E.g.,
More informationLecture 8 May 7, Prabhakar Raghavan
Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of
More informationCMPSCI 646, Information Retrieval (Fall 2003)
CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections
More informationClassic IR Models 5/6/2012 1
Classic IR Models 5/6/2012 1 Classic IR Models Idea Each document is represented by index terms. An index term is basically a (word) whose semantics give meaning to the document. Not all index terms are
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationCS105 Introduction to Information Retrieval
CS105 Introduction to Information Retrieval Lecture: Yang Mu UMass Boston Slides are modified from: http://www.stanford.edu/class/cs276/ Information Retrieval Information Retrieval (IR) is finding material
More informationInformation Retrieval. Information Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent
More informationThe Web document collection
Web Data Management Part 1 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University
More informationFoundations of Computing
Foundations of Computing Darmstadt University of Technology Dept. Computer Science Winter Term 2005 / 2006 Copyright c 2004 by Matthias Müller-Hannemann and Karsten Weihe All rights reserved http://www.algo.informatik.tu-darmstadt.de/
More informationTask Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 11, 2017 Sham Kakade 2017 1 Document
More information3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with
More informationMidterm Exam Search Engines ( / ) October 20, 2015
Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationPart 2: Boolean Retrieval Francesco Ricci
Part 2: Boolean Retrieval Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content p Term document matrix p Information
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationCS54701: Information Retrieval
CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationInformation Retrieval
Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval Outline ❶ Course details ❷ Information retrieval ❸ Boolean retrieval 2 Course details
More informationVector Space Models. Jesse Anderton
Vector Space Models Jesse Anderton CS6200: Information Retrieval In the first module, we introduced Vector Space Models as an alternative to Boolean Retrieval. This module discusses VSMs in a lot more
More informationSearch Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]
Search Evaluation Tao Yang CS293S Slides partially based on text book [CMS] [MRS] Table of Content Search Engine Evaluation Metrics for relevancy Precision/recall F-measure MAP NDCG Difficulties in Evaluating
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query
More informationvector space retrieval many slides courtesy James Amherst
vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationSession 10: Information Retrieval
INFM 63: Information Technology and Organizational Context Session : Information Retrieval Jimmy Lin The ischool University of Maryland Thursday, November 7, 23 Information Retrieval What you search for!
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 Information Retrieval Lecture 2: Boolean retrieval 2 Blanks on slides, you may want to fill in Last Time: Ngram Language Models Unigram LM: Bag of words Ngram
More informationMore on indexing CE-324: Modern Information Retrieval Sharif University of Technology
More on indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Plan
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 45/65 43/63 (Winter 08) Part 3: Analyzing Text (/) January 30, 08 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are
More informationIndexing and Query Processing. What will we cover?
Indexing and Query Processing CS 510 Winter 2007 1 What will we cover? Key concepts and terminology Inverted index structures Organization, creation, maintenance Compression Distribution Answering queries
More informationBoolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology
Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures
More informationIndexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information
More informationIntroduction to Computational Advertising. MS&E 239 Stanford University Autumn 2010 Instructors: Andrei Broder and Vanja Josifovski
Introduction to Computational Advertising MS&E 239 Stanford University Autumn 2010 Instructors: Andrei Broder and Vanja Josifovski 1 Lecture 4: Sponsored Search (part 2) 2 Disclaimers This talk presents
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationInformation Retrieval CSCI
Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1
More informationNotes: Notes: Primo Ranking Customization
Primo Ranking Customization Hello, and welcome to today s lesson entitled Ranking Customization in Primo. Like most search engines, Primo aims to present results in descending order of relevance, with
More informationCS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University
CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document
More informationRelevance of a Document to a Query
Relevance of a Document to a Query Computing the relevance of a document to a query has four parts: 1. Computing the significance of a word within document D. 2. Computing the significance of word to document
More informationQuerying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze
Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Overview Boolean Retrieval Weighted Boolean Retrieval Zone Indices
More informationmodern database systems lecture 4 : information retrieval
modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation
More informationInformation Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI
More informationSimilarity search in multimedia databases
Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationCS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University
CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression
More informationQuery Processing and Alternative Search Structures. Indexing common words
Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such
More informationText Retrieval an introduction
Text Retrieval an introduction Michalis Vazirgiannis Nov. 2012 Outline Document collection preprocessing Feature Selection Indexing Query processing & Ranking Text representation for Information Retrieval
More informationNear Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri
Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions
More informationBalancing Manual and Automatic Indexing for Retrieval of Paper Abstracts
Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Kwangcheol Shin 1, Sang-Yong Han 1, and Alexander Gelbukh 1,2 1 Computer Science and Engineering Department, Chung-Ang University,
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview
More information