Assignment 1. Assignment 2. Relevance. Performance Evaluation. Retrieval System Evaluation. Evaluate an IR system

Retrieval System Evaluation W. Frisch Institute of Government, European Studies and Comparative Social Science University Vienna Assignment 1 How did you select the search engines? How did you find the search engines? How did you evaluate the systems? How did you compare the systems? Did you test the system? Functionally? Performance? Systematically? Assignment 2 Get your account ready Understand what you need to do Use some Cut-and-Paste for answers. Evaluate an IR system Functional Evaluation Functional analysis Does the system provide most of the functions that the user expects? What are unique functions of this system? How user-friendly is the system? Error Analysis How often does the system fail? How easy does the user make errors? Performance Evaluation Given a query, how well will the system perform? How do we define the retrieval performance? Is finding all the related information our goal? Is it possible to know that the system has found all the information? Given user s information needs, how well will the system perform? Is the information found useful? -- Relevance Relevance Relevance Dictionary Definition: 1. Pertinence to the matter at hand. 2. Applicability to social issues. 3. Computer Science. The capability of an information retrieval system to select and retrieve data appropriate to a user's needs. 1

Relevance for IR A measurement of the outcome of a search The judgment on what should or should not be retrieved There are no simple answers to what is relevant and what is not relevant difficult to define subjective depending on knowledge, needs, time, situation, etc. The central concept of information retrieval Relevance to What? Information Needs Problems? requests? queries? The final test of relevance is if users find the information useful if users can use the information to solve the problems they have if users fill information gap they perceived. Relevance Judgment The user's judgment How well the retrieved documents satisfy the user's information needs How useful the retrieved documents If it is related but not useful, It is still not relevant The system's judgment How well the retrieved document match the query How likely would the user judge this information as useful? Factors for Relevance Judgment Subjects: Judge by their subject relatedness Novelty: -- how much new information in the retrieved document Uniqueness/Timeliness Quality/Accuracy/Truth Availability Source or pointer? Accessibility Cost Language English or non-english Readability Relevance Measurement Binary relevant or not relevant Likert scale Not relevant, somewhat relevant, relevant, highly relevant Precision and Recall Given a query, how many documents should a system retrieve: Are all the retrieved documents relevant? Have all the relevant documents been retrieved? Measures for system performance: The first question is about the precision of the search The second is about the completeness (recall) of the search. 2

Number of relevant documents retrieved Retrieved Relevant a Not Relevant b Precision = -------------------------------------------- Total number of documents retrieved Not retrieved c d a P = -------------- a+b a R = -------------- a+c Number of relevant documents retrieved Recall = ----------------------------------------------------- Number of all the relevant documents in the database Precision measures how precise a search is. the higher the precision, the less unwanted documents. Recall measures how complete a search is. the higher the recall, the less missing documents. Relationship of R and P Theoretically, R and P not depend on each other. Practically, High Recall is achieved at the expense of precision. High Precision is achieved at the expense of recall. When will p = 0? Only when none of the retrieved documents is relevant. When will p=1? Only when every retrieved documents are relevant. What does p=0.75 mean? What does r=.25 mean? What is your goal (in term of p & r ) when conducting a search? Depending on the purpose of the search Depending on information needs Depending on the system What values of p and r would indicate a good system or good search? There is not a fixed value. Why increasing recall will often mean decreasing precision? In order not to miss anything and to cover all possible sources, one would have to scan many more materials, of which many might be not relevant. 3

Ideal Retrieval Systems Ideal IR system would have P=1, R= 1, for all the queries Is it possible? Why? If information needs could be defined very precisely, and If relevance judgments could be done unambiguously, and If query matching could be designed perfectly, Then We would have an ideal system. Then It is not an information retrieval system. Alternative measures Combining recall and precision 2 F = ------------------------- 1/R + 1/ P 1 + k 2 E = ------------------------- 2 k / R + 1/ P User-Oriented Measures Measure: Coverage Relevant docs Retrieved Docs Coverage: the fraction of the documents known to the user to be relevant which has actually been retrieved Relevant docs Known to the user Relevant docs retrieved unknown to the user Coverage = ----------------------------------------- If coverage=1, Relevant Docs retrieved and known to the user Relevant Docs known to the user Everything the user knows has been retrieved. Measure: Novelty Novelty: the fraction of the relevant documents retrieved which was unknown to the user. Relevant docs unknown to the user Novelty= -------------------------------- Relevant docs retrieved Evaluation of IR Systems Using Recall & Precision Conduct query searches Try many different queries Results may depend on sampling queries. Compare results of Precision & Recall Recall & Precision need to be considered together. 4

Use Precision and Recall to Evaluate IR Systems P 1.0 P-R diagram System A P /R System A System B System C Query 1 Query 2 Query 3 Query 4 Query 5 0.9 / 0.1 0.7 / 0.4 0.45/0.5 0.3/0.6 0.1/ 0.8 0.8/ 0.2 0.5/ 0.3 0.4/0.5 0..3/0.7 0.2/0.8 0.9/ 0.4 0.7/ 0.6 0.5/ 0.7 0.3/0.8 0.2/ 0.9 0.5 System B System C 0.1 0.1 0.5 1.0 R Use fixed interval levels of Recall to compare Precision Use fixed intervals of the number of retrieved documents to compare Precision Number of relevant documents Precision System 1 System 2 System 3 System A Query 1 Query 2 Query 3 Average Precision R=.25 R=.50 R=.75 0.6 0.7 0.9 0.5 0.4 0.7 0.2 0.3 0.4 N=10 N=20 N=30 N=40 N=50 4 5 6 0.5 4 5 16 0.41 5 5 17 0.3 8 6 24 0.31 10 6 25 0.27 Number of documents retrieved Problems using P/R for Evaluation For real world system, Recall is always an estimate. Results depend on sampling queries. Recall and Precision do not catch interactive aspect of the retrieval process. Recall & Precision is only one aspect of system performance High recall/high precision is desirable, but not necessary the most important thing that the user considers. R and P are based on the assumption that the set of relevant documents for a query is the same, independent of the user. Quality Evaluation Data quality Coverage of database It will not be found if it is not in the database. Completeness and accuracy of data Indexing methods and indexing quality It will not be found if it is not indexed. indexing types currency of indexing ( Is it updated often?) indexing sizes 5

Web Coverage: total 320 million pages Examples: Invalid links Interface Consideration User friendly interface How long does it take for a user to learn advanced features? How well can the user explore or interact with the query output? How easy is it to customize output displays? User Satisfaction User satisfaction The final test is the user! User satisfaction is more important then precision and recall Measuring user satisfaction Survey Use statistics User experiments User Experiments Observe and collect data on System behaviors User search behaviors User-system interaction Interpret experiment results for system comparisons for understanding user s information seeking behaviors for developing new retrieval systems/interfaces An Landmark Study An evaluation of retrieval effectiveness for a full-text document retrieval system 1985, by David Blair and M. E. Maron The first large-scale evaluation on fulltext retrieval Significant and controversial results Good experimental Design 6

Recall The Setting An IBM full-text retrieval system with 40,000 documents of $350,000 pages. Documents to be used in the defense of a large corporate law suit. Large by 1985 standards; typical standard today Mostly Boolean searching functions, with some ranking functions added. Full-text automatic indexing. The Experiment Two lawyers generated 51 requests. Two paralegals conducted searches again and again until the lawyers satisfied the results Until the lawyers believed that more than 75% of relevant documents had been found. The paralegals and lawyers could have as many discussions as needed. The results Average precision=.79 Average Recall=.20 1.0 Precision Calculation The lawyers judged vita, satisfactory, marginally relevant, and irrelevant All the first three were counted as relevant in precision calculation..20.20 Precision 1.0 Recall Calculation Sampling from a subset of the database believed to be rich in relevant documents Mixed with retrieved sets to send to the lawyers for relevant judgments The most significant results The recall is low. Even though the recall is only 20%, the lawyers were satisfied (and believed that 75% of relevant documents had been retrieved). 7

Questions Why the recall was so low? Do we really need high recall? If the study were run today on search engines like Google, would the results be the same or different? Discussion: Levels of Evaluation On the engineering level On the input level On the processing level On the output level On the use and user level On the social level --- Tefko Saracevic, SIGIR 95 8