Retrieval Effectiveness Measures Vasu Sathu 25th March 2001 Overview Evaluatin in IR Types f Evaluatin Retrieval Perfrmance Evaluatin Measures f Retrieval Effectiveness Single Valued Measures Alternative Measures TREC Cllectin 2
Why Evaluate an IR system Evaluatin f Infrmatin Retrieval system is dne befre its final implementatin. T find ut whether the users really need such a system and will it be really wrth it. T select between alternate systems. T determine if a system meets expressed and unexpressed needs f current users and nn-users. T imprve IR systems and determine if imprvement actually ccured. 3 What t evaluate Cverage f the cllectin that is the extent t which the system includes relevant matter. Time lag, which is the interval between the time the search request is made and the time an answer is given. Frm f presentatin f the utput. Effrt invlved by the user in btaining answers t his search request. Recall f the system, which is the prprtin f relevant material actually retrieved in answer t a search request. Precisin, which is the prprtin f retrieved material that is actually relevant. 4
Types f Evaluatin IR system ften cmpnent f larger system. Might evaluate several aspects like Speed f retrieval. Resurces required. Presentatin f dcuments. Ability t find relevant dcuments. Evaluatin is generally cmparative. Mst cmmn evaluatin - retrieval effectiveness. 5 Retrieval Perfrmance Evaluatin The first step in the Evaluatin prcess is Functinal Analysis in which the specified system functinalities are tested ne by ne. It shuld als include an errr analysis phase during which it is useful t catch prgramming errrs. After the system has passed the functinal analysis phase it shuld prceed t evaluate the perfrmance f the system. In a system designed fr prviding data retrieval, the respnse time and space required are usually the metrics f mst interest fr evaluating the system. 6
Retrieval Perfrmance Evaluatin Perfrmance Evaluatin while using indexing structures depends n several factrs like interactin with the perating system, delays in cmmunicatin channels and verheads intrduced by many sftware layers. Relevance ranking plays a central rle in IR. IR systems require the evaluatin f hw precise is the answer set and this type f evaluatin is knwn as Retrieval Perfrmance Evaluatin. Retrieval Perfrmance Evaluatin is als dependent n tw majr factrs Test Reference Cllectins. Evaluatin measures. 7 Retrieval Perfrmance Evaluatin Test reference cllectin cnsists f - cllectin f dcuments - set f example infrmatin requests. - set f relevant dcuments fr each example infrmatin request Based n the ability f the retrieval system t distinguish between wanted and unwanted items. Retrieval task culd be in either f the tw mdes. - Batch : The user submits a query and receives an answer back - Interactive sessin : The user specifies his infrmatin need thrugh a series f interactive steps with the system. 8
Measures f Retrieval Effectiveness Effectiveness is purely a measure f the ability f the system t satisfy the user in terms f relevance f dcuments retrieved. A relevant dcument is ne which is quite related t the cntext f a query the user is interested. It is Recall and Precisin which attempt t measure what is knwn as the effectiveness f the retrieval system. It is the measure f the ability f the system t retrieve relevant dcuments while at the same hlding the nn- relevant ne. 9 Precisin and Recall Precisin - prprtin f a retrieved set that is relevant. Precisin = relevant retrieved / retrieved =P(relevant/retrieved) Recall - prprtin f all relevant dcuments in the cllectin included in the retrieved set. Recall = relevant retrieved / relevant =P(retrieved/relevant) Recall and precisin are inversely prprtinal t each ther 10
Precisin and Recall(Cntd..) Cntingency Table Relevant Nn Relevant Retrieved w x Relevant = w + y Retrieved = w + x Nt Retrieved y z Precisin = w/(w+x ) Recall = w/(w + y ) Fallut = x/(x + z ) Ttal N = w + x + y + z 11 Recall Graph Recall when mre and mre dcuments are retrieved. The graph represents a terraced shape 12
Precisin Graph Precisin when mre and mre dcuments are retrieved. The graph represents a sawtth shape 13 Precisin and Recall(Cntd..) Fallut :- Prprtin f retrieved nn-relevant dcuments F= y / ( N - Relevant) It defines hw well the system filters ut nn - relevant dcuments. Generality :- Prprtin f relevant dcuments in the cllectin G= Relevant / N. Criteria cmmnly used t evaluate perfrmance Recall Precisin User Effrt ie Amunt f time the user spends n cnducting the search and amunt f time user spends negtiating his enquiry and then separating relevant frm irrelevant items. 14
Alternative Measures A single measure which cmbines precisin and recall is given by harmnic mean F as F(j) = 2 / (1/r(j) + 1/P(j)) where - r(j) is the recall fr the j-th dcument in the ranking. - P(j) is the precisin fr the j-th dcument in the ranking. - F(j) s the harmnic mean f r(j) and P(j). The value f F lies between 0 and 1. 0 - when n relevant dcuments have been retrieved. 1- when all ranked dcuments are relevant. F assumes a high value when bth recall and precisin are high. 15 Alternative Measures(Cntd..) E measure (van Rijsbergen) Used t specify precisin( r recall).the E measure is defined as fllws E(j) = 1-1+ b 2 / (b 2 /r(j) + 1/P(j)) where - r(j) is the recall f the j-th dcument in the ranking. - P(j) is the precisin fr the j-th dcument in the ranking. - E(j) is the evaluatin measure relative t r(j) and P(j). - b is the user specified parameter which reflects the relative imprtance f recall and precisin. Fr b=1 E(j) wrks as cmplement f Harmnic Mean b > 1 - user is interested in precisin rather then recall b < 1 user is interested in recall than in precisin. 16 α
Single Valued Measures Nrmalized precisin r recall measures area between actual and ideal curves. Pint at which precisin = recall is called Breakeven pint. Swets Mdel. Expected search length. Utility measures - assign measures t each cell in the cntingency table. - sum (r average) csts fr all the queries. 17 Swets Mdel It was prpsed by Swets in 1963. Based n the signal detectin and statistical decisin thery. Prperties f a desirable measure f retrieval perfrmance. - Based n the ability f the retrieval system t distinguish between wanted and unwanted items. - It shuld express discriminatin pwer independent f any acceptance criterin emplyed by system r user. - Measure shuld be a single number. - It shuld allw cmplete rdering f different perfrmances, indicate the amunt difference and assess perfrmance in abslute terms. 18
2 σ 1 + σ 2 2 Swets Mdel(Cntd..) Characterizes recall fallut curves generated by the variatin f a cntrl variable. It uses distance between peratin characteristics as its measure. Brks equatin 2 2 S2 = u2 + u1 / ( ) ó + 1 ó 2 The area under the recall fallut graph is strictly increasing functin f S2 19 Expected Search Length Users are able t quantify their infrmatin need accrding t ne f the fllwing types - nly ne relevant dcument is needed. - sme arbitrary number n is wanted. - all relevant dcuments are wanted. - a given prprtin f the relevant dcuments. Output f a search strategy is assumed t be weak rdering f dcuments. Simple rdering means n tw r mre dcuments at the same level rdering. Search length is the number f nn-relevant dcuments a user must scan befre the infrmatin need is satisfied. 20
Expected Search Length(Cntd..) ESL = j + i.s/(r + 1) where - q is the query f given type - j is the ttal number f dcuments relevant t q in all levels preceding the final. - r is the number f relevant dcuments in the final level - i is the number f nn-relevant dcuments in the final level. - s is the number f relevant dcuments required frm the final level t satisfy need. Use mean expected search length fr a set f queries. If queries r cllectins vary, cmpare ESL t the expected randm search length. 21 TREC Cllectin Text Retrieval Cnference(TREC) is dedicated t experiment with a large test cllectin cmprising ver a millin dcuments. TREC series is rganized by the Natinal Institute f Standards and Technlgy(NIST). Its gal is t encurage research in IR frm large text applicatins by prviding a large text cllectin TREC cllectin is cmpsed f three parts : Dcuments. Example infrmatin requests(called tpics). set f relevant dcuments fr each example infrmatin request. 22
TREC Cllectin(Cntd..) TREC cllectin has dcuments which cme frm surces like Wall Street Jurnal(WSJ),Assciated Press(AP),Federal Register(FR),US Patents(PAT) etc. The task f cnverting an infrmatin request(tpic) in t a system query must be dne by the system itself and is cnsidered t be an integral part f the evaluatin prcedure. The set f relevant dcuments fr each example infrmatin request is btained frm a pl f pssible relevant dcuments. Pling methd is used evaluate the relevance f each dcument. 23 Cnclusin Retrieval can be made mre effective by using mre techniques and making the search mre effective. Users shuld be able t search in ways that are already familiar r that they have fund t be effective. A visual representatin f the cntents f a system may aid users in rienting themselves. 24
References Mdern IR] Ricard Baeza-Yates, Berthier Ribeir-Net. "Mdern Infrmatin Retrieval." Addisn-Wesley (ACM Press), January 1999. [IR] C.J. van Rijsbergen. "Infrmatin Retrieval, Secnd Editin.",1999, 192 pages. Inf Strage & Retrieval] Rbert R. Krfhage. "Infrmatin Strage and Retrieval." Wiley, May 1997. \ Cmmunicatins f the ACM Jurnal- An evaluatin f retrieval effectiveness fr a full-text dcument retrieval system -David C. Blair and M.E. Marn Vl 28 Number3 25