Βαθμολόγηση, Διαβάθμιση όρων και το μοντέλο διανυσματικού χώρου

Size: px

Start display at page:

Download "Βαθμολόγηση, Διαβάθμιση όρων και το μοντέλο διανυσματικού χώρου"

Oswin McDowell
5 years ago
Views:

1 Βαθμολόγηση, Διαβάθμιση όρων και το μοντέλο διανυσματικού χώρου

2 Γιατί η διαβάθμιση; Μέχρι τώρα έχουμε δει ευρετήρια τα οποία υποστηρίζουν Μπούλειo αναζήτηση: ένα έγγραφο είτε ταιριάζει είτε δεν ταιριάζει με κάποια επερώτηση Σε μεγάλες συλλογές η Μπούλειος αναζήτηση μπορεί να μας δίνει τεράστιο πλήθος απαντήσεων.. Είναι λοιπόν σημαντικό να μπορούμε να ιεραρχούμε τις απαντήσεις, δίνοντας κάποιο βαθμό (score) στα ζεύγη (query,document) 2

3 This lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring Term weighting Vector space retrieval

4 Metadata Digital documents usually contain certain metadata associated with each document: data about a document, such as author, title, date of publication, abstract etc. Fields: metadata entities whose possible values are finite; e.g. the set of all dates of authorship, the set of all names etc. Zones: entities whose values are typically an arbitrary, unbounded amount of text e.g., document titles and abstracts 4

5 Metadata (ctd) Fields Language = French Format = pdf Subject = Physics etc. Date = Feb 2000 Values

6 Parametric search A parametric search interface allows the user to combine a full-text query with selections on field values e.g., language, date range, etc.

7 Parametric search example Notice that the output is a (large) table. Various parameters in the table (column headings) may be clicked on to effect a sort.

8 Parametric search example We can add text search.

9 Parametric/field search In these examples, we select field values Values can be hierarchical, e.g., Geography: Continent Country State City A paradigm for navigating through the document collection, e.g., Aerospace companies in Brazil can be arrived at first by selecting Geography then Line of Business, or vice versa Filter docs in contention and run text searches scoped to subset

10 Index support for parametric search Must be able to support queries of the form Find pdf documents that contain stanford university A field selection (on doc format) and a phrase query Field selection use inverted index of: <field values docids> Organized by field name Use compression etc. as before

11 Parametric index support Optional provide richer search on field values e.g., wildcards Find books whose Author field contains s*trup Range search find docs authored between September and December Inverted index doesn t work (as well) Use techniques from database range search See for instance for a summary of B-trees Use query optimization heuristics as before

12 Zones A zone is an identified region within a doc E.g., Title, Abstract, Bibliography Generally culled from marked-up input or document metadata (e.g., powerpoint) Contents of a zone are free text Not a finite vocabulary Indexes for each zone - allow queries like sorting in Title AND smith in Bibliography AND recur* in Body Not queries like all papers whose authors cite themselves Why?

13 Zone indexes simple view Title Author Body

14 Zone indexes 14

15 So we have a database now? Not really. Databases do lots of things we don t need Transactions Recovery (our index is not the system of record; if it breaks, simply reconstruct from the original source) Indeed, we never have to store text in a search engine only indexes We re focusing on optimized indexes for textoriented queries, not an SQL engine.

16 Scoring and Ranking

17 Το ερώτημα Πώς μπορούμε να επιλέξουμε τις απαντήσεις μιας αναζήτησης, οι οποίες να είναι περισσότερο συναφείς με τις πληροφορίες που αναζητά ο χρήστης; 17

18 Scoring Thus far, our queries have all been Boolean Docs either match or not Good for expert users with precise understanding of their needs and the corpus Applications can consume 1000 s of results Not good for (the majority of) users with poor Boolean formulation of their needs Most users don t want to wade through 1000 s of results cf. use of web search engines

19 Scoring We wish to return in order the documents most likely to be useful to the searcher How can we rank order the docs in the corpus with respect to a query? Assign a score say in [0,1] for each doc on each query Begin with a perfect world no spammers Nobody stuffing keywords into a doc to make it match queries More on adversarial IR under web search

20 Weighted zone scoring (βαθμολόγηση ζώνης) Given a Boolean query q and a document d, weighted zone scoring assigns to the pair (q,d) a score in the interval [0,1]. To this end, it computes a linear combination of zone scores, where each zone of the document contributes a Boolean value to the combination. Assume a set of docs, each of which with l zones.... Di Di.z1 Di.z2 Di.zl q Si g1 x {0,1} + g2 x {0,1} gl x {0,1} 20

21 Formal definition Let Boolean query q and a document d with l zones l Let g1,...,gl [0,1], such that gi = 1 i=1 Let si (1 i l) be the Boolean score denoting a match (or absence thereof) between query q and the i-th zone of document d: si is a Boolean function that maps the presence of query terms in a zone to 0, 1 The weighted zone score is defined to be: l gi. si 1 21

22 Linear zone combinations First generation of scoring methods: use a linear combination of Booleans: E.g. Score = 0.6*<sorting in Title> + 0.3*<sorting in Abstract> *<sorting in Body> *<sorting in Boldface> Each expression such as <sorting in Title> takes on a value in {0,1}. Then the overall score is in [0,1]. For this example the scores can only take on a finite set of values what are they?

23 Linear zone combinations In fact, the expressions between <> on the last slide could be any Boolean query Who generates the Score expression (with weights such as 0.6 etc.)? In uncommon cases the user, in the UI Most commonly, a query parser that takes the user s Boolean query and runs it on the indexes for each zone

24 Exercise On the query bill OR rights suppose that we retrieve the following docs from the various zone indexes: Author Title Body bill rights bill rights bill rights Compute the score for each doc based on the weightings 0.6,0.3,0.1

25 General idea We are given a weight vector whose components sum up to 1. There is a weight for each zone/field. Given a Boolean query, we assign a score to each doc by adding up the weighted contributions of the zones/fields. Typically users want to see the K highestscoring docs.

26 Index support for zone combinations In the simplest version we have a separate inverted index for each zone Variant: have a single index with a separate dictionary entry for each term and zone E.g., bill.author bill.title bill.body Of course, compress zone names like author/title/body.

27 Zone combinations index The above scheme is still wasteful: each term is potentially replicated for each zone In a slightly better scheme, we encode the zone in the postings: bill 1.author, 1.body 2.author, 2.body 3.title As before, the zone names get compressed. At query time, accumulate contributions to the total score of a document from the various postings, e.g.,

28 Score accumulation bill 1.author, 1.body 2.author, 2.body 3.title rights 3.title, 3.body 5.title, 5.body As we walk the postings for the query bill OR rights, we accumulate scores for each doc in a linear merge as before. Note: we get both bill and rights in the Title field of doc 3, but score it no higher. Should we give more weight to more hits?

29 Algorithm for computing weighted score Compute weighted scores directly from inverted indexes assuming query q is a 2-term query consisting l of query terms q1 and q2. compute gi. si 1 29

30 Learning weights Εκμάθηση βαρών

31 Where do these weights come from? Machine learned relevance: a general class of approaches to scoring and ranking in information retrieval Given A test corpus A suite of test queries A set of relevance judgments: in its simplest form, a relevance judgment is either Relevant or Nonrelevant Learn a set of weights such that relevance judgments matched Can be formulated as ordinal regression

32 Learning for weighted zone scoring Problem formulation: Learning a linear function of the Boolean match scored contributed by the various zones. Expensive component: labor-intensive assembly of user-generated relevance judgments from which to learn the weights The learning problem can be reduced to a simple optimization problem 32

33 Learning weights Simple case of weighted zone scoring each document has a title and body zones Given a query q and a document d, use the Boolean match function to compute st(d,q) and sb(d,q), depending on whether the title/body zone of d matches q, and then: score(d,q) = g. st(d,q) + (1-g). sb(d,q) g is the parameter that we want to estimate, using a set of training examples Φj = (dj,qj,r(dj,qj)) where r(dj,qj) is the relevance judgment from a human editor 33

34 Training examples We are given training examples, each of which is a triple: DocID d, Query q and Judgment Relevant/Non. From these, we will learn the best value of w. 34

35 Learning weights How do we estimate g??? Given some g, the score for each Φj is: score(dj,qj) = g. st(dj, qj) + (1-g). sb(dj, qj) We want to find a value g, such that: Σ ε(g, Φ j ) is minimum, where: j ε(g, Φj) = (r(dj,qj) - score(dj,qj)) 2 35

36 Learning weights Let n01r / n01n denote the number of training examples for which: st(dj, qj)=0 sb(dj, qj)=1 the editorial judgment is Relevant / Non-relevant The contribution to the total error from training examples for which [st(dj, qj)=0, sb(dj, qj)=1] and the editorial judgment is [Relevant/Non-relevant], is: Taking into account all combinations of values, we get a total error of: 36

37 Learning weights This is a quadratic function f(g) How do we find its minimum? df(g)/dg = 0 37

38 Scoring: density-based Thus far: position and overlap of terms in a document - title, author, etc. Obvious next idea: if a document talks about a topic more, then it is a better match This applies even when we only have a single query term. Document relevant if it has many occurrences of the term(s) This leads to the idea of term weighting. 38

39 Term frequency and weighting

40 Weighted Zone Scoring for Boolean Queries Let Boolean query q and a document d with l zones si (1 i l) is a Boolean value denoting a match between q and the i-th zone of d: The weighted zone score is defined to be: score(q,d) = gi. si where g1,...,gl [0,1] and gi = 1 l 1 40

41 Term Frequency A document or zone that mentions a query term more often has more to do with that query therefore, it should receive a higher score Each term in a document can be assigned a weight that depends on the number of occurrences of the term in the document (term frequency, tff,d) WARNING: frequency is used to mean count Thus term frequency in IR literature is used to mean number of occurrences in a doc Not divided by document length (which would actually make it a frequency) 41

42 Bag of words Documents can be modeled as a bag of words the exact ordering of words is ignored we retain only information on the number of occurrences of each term we assume that two documents with the similar bag of words representation are similar in content Thus the doc John is quicker than Mary. is indistinguishable from the doc Mary is quicker than John. Which of the indexes discussed so far distinguish these two docs? 42

43 Which words count? For a free text query q=(q1,...,ql), all terms in q are equally important, when assessing relevancy of q to d However, certain terms have little or no discriminating power in determining relevance - they may appear to almost all documents of a collection 43

44 Which words count? Consider the ides of march query on Shakespear s works Julius Caesar has 5 occurrences of ides No other play has ides march occurs in over a dozen All the plays contain of By this scoring measure, the top-scoring play is likely to be the one with the most ofs 44

45 Term frequency tf Long docs are favored because they re more likely to contain query terms Can fix this to some extent by normalizing for document length But is raw tf the right measure?

46 Weighting term frequency: tf What is the relative importance of 0 vs. 1 occurrence of a term in a doc 1 vs. 2 occurrences 2 vs. 3 occurrences Unclear: while it seems that more is better, a lot isn t proportionally better than a few Can just use raw tf Another option commonly used in practice:

47 Score computation Score for a query q = sum over terms t in q: [Note: 0 if no query terms in document] This score can be zone-combined Can use wf instead of tf in the above Still doesn t consider term scarcity in collection (ides is rarer than of)

48 Weighting should depend on the term overall Which of these tells you more about a doc? 10 occurrences of hernia? 10 occurrences of the? Would like to attenuate the weight of a common term But what is common?

49 Metrics of common Collection frequency (cf ) = total number of occurrences of the term in the entire collection of documents Document frequency (df ) = number of docs in the corpus containing the term df is preferable over cf: Word cf df ferrari insurance the few docs that contain ferrari should get a higher boost for a query on ferrari than the docs that contain insurance get from a query on insurance 49

corpus could just be raw count of number of documents the term

50 Inverse Document Frequency Inverse document frequency (idf ) measure of informativeness of a term: its rarity across the whole corpus could just be raw count of number of documents the term occurs in (idf i = 1/df i ) but by far the most commonly used version is: 50

51 tf x idf term weights Combine term frequency (or wf, or some measure of term density) and inverse document frequency, to produce a composite weight for each term in each document tf-idft,d = tft,d x idft tf-idf is: highest when t occurs many times within a small number of documents (lending high discriminating power to those docs) lower when the term occurs fewer times in a doc, or occurs in many documents (thus offering less pronounced relevance signal) lowest when the term occurs in virtually all documents

52 Summary: tf x idf (or tf.idf) Assign a tf.idf weight to each term i in each document d What is the wt of a term that occurs in all of the docs? Increases with the number of occurrences within a doc Increases with the rarity of the term across the whole corpus

53 Real-valued term vectors Still Bag of words model Each is a vector in R M Here log-scaled tf.idf 53

54 Άσκηση Θεωρείστε τον ακόλουθο πίνακα με τις συχνότητες όρων γιατρίααρχείαdoc1, Doc2, Doc3 Υπολογίστε τα βάρη tf-idf για τους όρους car, auto, insurance, best για κάθε αρχείο, χρησιμοποιώντας τις τιμές idf που δίνονται στον δεύτερο πίνακα tf

55 Τιμές tf-idf Doc1 Doc2 Doc3 car auto insuranc e best

56 Vector Space Model Each doc j can now be viewed as a vector of wf idf values, one component for each term So we have a vector space terms are axes docs live in this space even with stemming, may have 20,000+ dimensions We denote by V(d) the vector derived from document d, with one component of the vector for each dictionary term components are computed using the tf-idf weighting scheme

it With D being expressed as a vector, we need to find other vectors

57 Why turn docs into vectors? First application: query by example Given a doc D, find others like it With D being expressed as a vector, we need to find other vectors (docs) near it Assumption: documents that are close together in the vector space, talk about the same things 57

58 Similarity metrics How do we quantify the similarity between two documents in the Vector Space model? Take the difference between the two vectors not a solution (why?) A-B -B A B 58

59 Cosine similarity Distance between vectors V(d1) and V(d 2 ) captured by the cosine of the angle θ between them. Note this is similarity, not distance No triangle inequality for similarity. t 3 d 2 θ d 1 t 1 t 2

60 Cosine similarity Mathematical definition: sim(d1,d2) = V(d1). V(d2) / V(d1). V(d2) Cosine of angle between two vectors Numerator: inner product of two vectors Denominator: product of Euclidean lengths of the vectors. The effect of the denominator is to normalize the length of the two vectors: v(d1) = V(d1)/ V(d1) is a unit vector Longer documents don t get more weight

61 Example Docs: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights term frequencies cos(sas, PAP) =.996 x x x 0.0 = cos(sas, WH) =.996 x x x.254 = term vectors (taking into consideration only raw tf s and normalizing for only the 3 terms)

62 Using the cosine similarity When searching for the documents in a collection that are most similar to a document d: we reduce this problem to that of finding the documents di with the highest inner product v(di).v(d) 62

63 Term-document matrix Viewing a collection of N documents as a collection of vectors leads to a view of the collection as a MxN matrix: M rows: represent the M terms N columns: represent the N documents Each is a vector in R v Here log-scaled tf.idf

64 Queries as vectors Given a free text query q: view this as a bag of words, treating it as a very short document compute the query vector v(q) Answers to this query are documents of the collection that have a high cosine similarity value with v(q) In other words, we score documents as follows: score(q,d) = V(q). V(d) / V(q). V(d) 64

65 Computing vector scores Collection of documents, each represented by a vector Free text query, represented by a vector Positive integer K Seek K docs of the collection with the highest vector space scores on the given query Basic algorithm follows: Length[N]: array holding the lengths (normalization factors) for each of the N documents Scores[N]: scores for each document 65

66 Basic cosine score computation Calculate weight for query term t Where is this stored? Update score of all documents d that contain t, by adding its contribution to sim(q,d) Normalize

67 Notes on score computation Why not store normalized tf-idf value at each postings entry? idf the same for all postings entries in a single postings list (term) What about tf/normalization? Space blowup because of floats tf s stored as integers Lengths stored once per doc

68 Document & query weighting schemes Several variations of vector space scoring exist The SMART notation represents different variations with the form ddd.qqq ddd: encodes the term weighting of a doc vector qqq: encodes the weighting in the query vector term frequency document frequency normalization 68

69 Back-up slides

70 Full text queries We just scored the Boolean query bill OR rights Most users more likely to type bill rights or bill of rights How do we interpret these full text queries? No Boolean connectives Of several query terms some may be missing in a doc Only some query terms may occur in the title, etc.

71 Full text queries To use zone combinations for free text queries, we need A way of assigning a score to a pair <free text query, zone> Zero query terms in the zone should mean a zero score More query terms in the zone should mean a higher score Scores don t have to be Boolean Will look at some alternatives now

72 Incidence matrices Recall: Document (or a zone in it) is binary vector X in {0,1} v Query is a vector Score: Overlap measure:

73 Example On the query ides of march, Shakespeare s Julius Caesar has a score of 3 All other Shakespeare plays have a score of 2 (because they contain march) or 1 Thus in a rank order, Julius Caesar would come out tops

74 Overlap matching What s wrong with the overlap measure? It doesn t consider: Term frequency in document Term scarcity in collection (document mention frequency) of is more common than ides or march Length of documents (And queries: score not normalized)

75 Overlap matching One can normalize in various ways: Jaccard coefficient: Cosine measure: What documents would score best using Jaccard against a typical query? Does the cosine measure fix this problem?

76 Scoring: density-based Thus far: position and overlap of terms in a doc title, author etc. Obvious next: idea if a document talks about a topic more, then it is a better match This applies even when we only have a single query term. Document relevant if it has a lot of the terms This leads to the idea of term weighting.

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following