b!"$#%&'(!) *,+.-0/1-0/2 3547698;:'<>=?@A8;BC<D6 E5FHGIKJMLNGPORQTSUVGTJXWKYJPZ Q\[]'G^G'ORQTS`_QTIaLNY[ Recall and Precision Alternative Measures cedfgdthpdjik'd^l"mrǹ ncd\kxo;pcmdiq TREC Collection CACM and ISI Collections CFC (Cystic Fibrosis Collection) 2
rsutwvhxzy {~}t Tx.s 5ƒ T A Kŵ ' uš ŒP Ž wšj ` ŠT a NˆD time and space 5 H K MšN P RœT ž T XŸK P 0œ ^ 9 RœT œ\ ;šcd for data retrieval response time and the space required for information retrieval: evaluation of how precise is the answer set based on a test reference collection and an evaluation measure reference collection: documents, example information requests, relevant documents evaluation measure: quantifies the similarity between the set of documents retrieved by strategy S and those provided by the specialist. 3 ª «H M ²± ³ «µj µj ¹5º» ¼P» ½u¾À g¾j¼'á I : an information request R : a set of relevant documents A : a document answer set in answer set R a collection R answer set A 4
5ØHÙTÚ'ÛTÜ Ü ÂªÃ ÄHÅ ÆPÆÇÅ È~É Ê ËHà ÄÌTÍ&ÌTÎ È ÏÑÐwÒÓHÔ'Õ Ö the fraction of the relevant documents (the set R) which has been retrieved. Ý5ÞVßPà áaânãxâcäwå R Recall = a R the fraction of the retrieved documents (the set A) which is relevant. R Precision = a A 5 æªç èhé êpêçé ë~ì í îhç èïtð&ïtñ ë òñówôõhö' ø ù5ú9ûdü ýÿþ R q ={d 3, d 5, d 9, d 25, d 39, d 44, d 56, d 71, d 89, d 123 } R q : a set containing the relevant documents for q. Ranking for query q : d 123 d 84 d 56 d 6 d 8 d 9 d 511 d 129 d 187 d 25 d 38 d 48 d 250 d 113 d 3 Precision at 11 standard recall levels 6789 :; :<= #! " $&%$('$()$+*$(,$ -$+.$(/$(0$ %$$ 1 23455 6
š >@?BADCBEEFCHGJI KMLD?BAONQPRNQSTG UWVYX[ZD\^]_ `badcfeqghqifekjlg^eqmn own prqscfeqgoutvowgexm^hzyy {}ni t~geqo ƒ ˆ Š }ƒ~ ˆƒŠ } Œ ŒŽ ƒ ˆ } ˆƒ} W r ˆƒŠ Ÿš š ž š š ±²³ µ œ š š ª «Ž evaluate quantitatively both the quality of the overall answer set and the breadth of the retrieval algorithm advantages simple, intuitive, can be combined in a single curve. 7 º¹B»D¼¾½½ ¼HÀJÁ ÂÄÃD¹B»dÅzÆÇÅzÈTÀ ÉWÊYË[ÌkÍÎÏ ÐÒÑrÓ ÔÖÕx ØdÙÛÚÜ ÝÞØßÑàÝ ámáâú ãäó Øxå compare retrieval performance for individual queries æòçrè éöêxë ìdíûîüëïþìûðäìüîòñuï ó^ìqñ Average precision at seen relevant documents averaging the precision figures obtained after each new relevant document is observed (in the ranking) favors systems which retrieve relevant documents quickly R-precision Precision histograms Summary table statistics 8
ôºõböd ¾øø HùJú ûäüdõbödýzþçýzÿtù!" "$#&%('*),+-/.103243+657598(:<; proper estimation of maximum R for a query requires detailed knowledge of all the documents in the collection. R and P are related measures which capture different aspects of the set of retrieved documents R and P measure the effectiveness over a set of queries processed in batch mode R and P are easy to define when a linear ordering of the retrieved documents is enforced 9 =?>A@ BDCFEHGI@KJ LMBON BPGRQTSC BPQ UWVYX/Z X\[K]3^`_ abcfdwz6[a ewfyg/hdikjwh6lkm noh prqdsatuwvxyu{ztw }~t6 t6ƒs uts Coverage Novelty Relative recall Recall effort ˆ Š Œ*ŽW 6 YŒw Expected search length Satisfaction Frustration 10
? A D F rĩ šm œ P R Ÿž P / 3 ªW«Y / \ K 3 `± ²³ FµW 6 ² a high value only when both recall and precision are high 2 F ( j) = r(j) : the recall for the j-th ranked document 1 1 + P(j) : the precision for the j-th ranked document r ( j) P ( j) F(j) : the harmonic mean of r(j) and P(j) W Y /¹Dºk»W¹6¼K½ ¾ ¹ to allow the user to specify whether he is more interested in recall or in precision E(j) : E evaluation measure 2 1+ b b : a user specified parameter which reflects the E( j) = 1 2 b 1 relative importance of recall and precision + r( j) P( j) (b>1: more precision, b<1: more recall) 11 À ÁFÂDÃyÄÆÅÇÃ*È ÂÊÉWË ÂPÌ Í ÂPÎ ÁŸÏÐÈÂIÁ R answer set A known to the user U known to the user which were retrieved R k previously unknown to the user which were retrieved R u [Coverage and novelty ratios for a given example information request] 12
Ñ Ò*Ó<Ô Õ ÖOÔF ÆÓÙØ Ú ÓPÛ Ü ÓPÝ ÒTÞÐÔ ÓPÒ ß àâá ã ä3åæ coverage = R U k novelty = R u Ru + R k high coverage ratio The system is finding most of the relevant documents the user expected to see. high novelty ratio The system is revealing (to the user) many new relevant documents which were previously unknown. 13 çœèyé$èùê èêëhì èîíhï1ðaðè<ì*ñâòæïmëó ôõ,ö{ (øù Šú{ û{ü ýþâý6 ÿ/û øö û ý öú (ÿ lacks a solid formal framework as a basic foundation lacks robust and consistent testbeds and benchmarks experiments was based on relatively small test collections comparisons between various retrieval systems were difficult "!#%$ TREC (Text Retrieval Conference) collection large size, thorough experimentation CACM and ISI collections historical importance in IR Cystic Fibrosis collection 14
&(')*&+-,/..103242)658790;: <>=@?%ACBEDGF4HJI(AKJL@MD#NONPAQF LSRTDK sources (WSJ, AP, ZIFF, FR, DOE, etc.) documents are tagged with SGML for easy parsing preserve as much of the original structure as possible provide a common framework for simple decoding <doc> <docno> WSJ880406-0090 </docno> <h1> AT &T Unveils Services to Upgrade Phone Networks Under Global Plan </h1> <author> Janet Guyon(WSJ Staff) </author> <dateline> New York </dateline> <text> American Telephone & Telegraph Co. Introduced the first of a new generation of phone services with broad.. </text> </doc> 15 UWV1XYU1Z-[1\ \]3^4^X6_8`a];b cedgfhijk l>m@n%ocp@qsrutwvyxocz {J}~G t rƒ ~g{ ˆoƒ GŠJoƒ e " ŒTm~v@ Ž description of an information need in natural language for testing a new ranking algorithm <top> <num> Number: 168 <title> Topic: Financing AMTRAK <desc> Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) <narr> Narrative: A relevant document must provide information on the government s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant. </top> 16
W 1 Y 1-1 3 4 6 8 š ; œe gžÿ > @ % ª#«Ē ²± ³µ 8 g @¹%«±e w¹yºsy ¼ J " s½ («"¾ g ƒ G³J ƒ À the set of relevant documents for each topic obtained from a pool of possible relevant documents the pool is created by taking the top K documents in the rankings generated by the various retrieval systems pooling method a technique for assessing relevance the vast majority of the relevant documents is collected in the assembled pool the documents which are not in the pool can be considered to be not relevant 17 ÁWÂ1ÃYÁ1Ä-Å1Æ ÆÇ3È4ÈÃ6É8ÊËÇ;Ì ÍeÎgÏÐÑÒÓ Ô>Õ@Ö% CØÙ Ú%ÛÜÖÝ(ÞƒßÜàáÕÞuâãàyâ ad hoc task a set of new (conventional) requests are run against a fixed document database for training and allowing the tuning of the retrieval algorithm routing task a set of fixed requests are run against a database whose documents are continually changing filtering for testing the tuned retrieval algorithm 18
ä(åæèçwé-çëê ìîíï ðòñwðwçó3ô4ôæöõ8 gøóùíú ûüþýüÿ ü 3204 Communications of the ACM articles subfields author, date, title, abstract, categories, references, bibliographic coupling, number of co-citations! 1460 documents selected from ISI similarities based on terms and on cross-citation patterns subfields author, word stems from title and abstraction sections, number of co-citations for each pair of articles 19 "$#&%('*)+'-,.0/&1 23*24'&576869%;:=<>95?/&@ A&BDCFEC G HIC G JKHMLON P=CRQ SUTWVXTZY Ë [\?]B]T^Ǹ _a_ SJC G N [H bdcoefefgahaikjfcol mdnpozq rschutvmdnpoq wxgzyao{t wxgzy o{t} ~rsc ht F Iƒ F d ˆ Š F pœ p Ž ˆ O ˆ ŒF z F p ˆ x Document statistics for the CACM and ISI collections µ O f œa z O F Iƒ F ³ x² œ Fœz œa O Œ œ ²{ Iœz Fœz u± ˆšû d ˆ œ œzžÿx Iœz Fœz u± ˆ p œ œzžÿx ª a«ˆš ˆ Query statistics for the CACM and ISI collections 20
4 º¹¼»¾½À ÂÁà ÄÆÅÇÃÉȾÊÌËÍ MÃ9 λºË7ÏKÏ ¹ÐÄ=Á`ÃË?Ñ Ò&ÓÕÔÖØ RÙ~ÚMÛÙuÜÞÝKß{ÖØÙ ÖàÚKß á á â Ú Ùßã 1239 documents fields contained in each document MEDLINE accession number Author Title Source Major subjects Minor subjects Abstract (or extract) References Citations 21 ä4åºæ¼ç¾èàéâêë ìæíçëéî¾ïìðíémë9éîçºð7ñkñ æðì=ề ëð?ò óõôõö? ùø ú*ûª üéýfþÿý þ ü þ ý 9ü8ø Iý ø set of relevance scores was generated directly by human experts through a careful evaluation strategy includes a good number of information requests as a result, the respective query vectors present overlap among themselves. allows experiment with retrieval strategies which take advantage of past query sessions to improve retrieval performance 22