信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1
Last week We have discussed: A complete search system Today: Brief review of last week Evaluation in an information retrieval system Second assignment 2
Course schedule ( 日程安排 ) Lecture 1 Introduction (Chapter 1) Boolean retrieval Lecture 2 Term vocabulary and posting lists (Chapter 2) Lecture 3 Dictionaries and tolerant retrieval (Chapter 3) Lecture 4 Index construction (Chapter 4) Lecture 5 Scoring, term weighting, the vector space model (Chapter 6) Lecture 6 A complete search system (Chapter 7) Lecture 7 Lecture 8 Evaluation in information retrieval Web search engines, advanced topics, conclusion Final exam 3
LAST WEEK 4
1) Initially, we have a set of documents. 5
2)Linguistic processing is applied to these documents (tokenization, stemming, language detection ) Each document is a set of terms. 6
3) The IR System keeps a copy of each document in a cache ( 缓存 ). This is useful to generate snippets ( 片段 ) 7
Snippet: a short text that accompany each document in the result list of a search engine 8
4) A copy of each document is given to indexers. These programs will create different kind of indexes: positional indexes, indexes for spell correction, structures for inexact retrieval. 9
5) When a user searches using a free-text query, the query parser transforms the query, and spell-correction is applied. 10
6) The indexes are then used to answer the query. Documents are scored and ranked 11
7) A page of results is generated and show to the user. 12
EVALUATION IN AN INFORMATION RETRIEVAL SYSTEM Chapter 8, pdf p. 188 13
Introduction In previous chapters, we have discussed many techniques. Which techniques should be used in an IR system? Should we use stop lists? Should we use stemming? Should we use TF-IDF? 14
Different search engines will show different results BAIDU BING How can we measure the effectiveness of an IR system? 15
Introduction We will discuss: How can we measure the effectiveness of an IR system? Document collections use to evaluate an IR system? Relevant vs non relevant documents Evaluation methodology for unranked retrieval results. Evaluation methodology for ranked retrieval results. 16
User utility We discussed the concept of document relevance ( 文件关联 ) for a query. Relevance is not the most important measure. User utility: What makes the user happy? Speed of response, Size of the index, Relevance of the results, User interface design ( 用户界面设计 ): clarity ( 清晰 ), layout ( 布局 ), responsiveness ( 响应能力 ) of the user interface. he generation of high-quality snippets ( 片段 ) 17
User utility What makes the user happy? Speed of response, Size of the index, Relevance of the results, User interface design ( 用户界面设计 ): clarity ( 清晰 ), layout ( 布局 ), responsiveness ( 响应能力 ) of the user interface. The generation of high-quality snippets ( 片段 ) 18
Snippet: a short text that accompany each document in the result list of a search engine 19
8.1 How to evaluate an IR system? To evaluate an IR system, we can use: a collection of documents a set of test queries a set of relevance judgments indicating which document is relevant for each query. Testing data ( 测试数据 ) Documents Queries Relevance judgments 20
Traditional evaluation approach The standard approach is to consider whether retrieved documents are relevant or not for a query. We use the set of test queries to evaluate whether an IR system returns relevant results. The relevance judgments are also called the ground truth ( 地面的真理 ) or gold standard ( 金标准 ). It is recommended to use at least 50 queries to evaluate an IR system. 21
Information needs vs queries There is a distinction between query and information needs. A user has an information needs (wants to find some information) But the same query may correspond to different information needs. QUERY = PYTHON - an animal? - or a programming language? 22
Information needs vs queries To evaluate an IR system, we will make a simplification. We will suppose that a document is either relevant (1) or irrelevant (0) for a query. But in real-life, a document may be partially relevant. We will ignore this for now. 23
Tuning an information systems IR systems have parameters ( 参数 ) that can be adjusted ( 调整 ). e.g. we can use different scoring functions to retrieve documents Depending on how the parameters are adjusted, the IR system may perform better or worse on the test data. To adjust the parameters, we should use some data that is different from the testing data. Otherwise, it would be like cheating 24
1) Tuning an IR system Training data ( 训练数据 ) IR SYSTEM Results are good? Adjusting the parameters 2) Testing the IR system Testing data ( 测试数据 ) Results are good? 25
8.2 Standard test collections If we develop a new IR system, we could create our own data for training an testing our system. However, there exists some standard collections of documents, which can be used for training/testing and IR system. A few examples 26
The GOV2 collection GOV2: a collection of 25 million webpages. Size of the data: 426 GB http://ir.dcs.gla.ac.uk/test_collections/access_ to_data.html Provided by the University of Glasgow, UK. Useful for researchers and companies working on the development of web search engines and IR systems. But more than 100 times smaller than the number of webpages on the Internet. 27
Reuters Two documents collections of news articles. Reuters-21578: 21,578 news articles. Reuters-RCV1: 806,791 news articles. This data is especially useful for testing system to classify new articles into categories. 28
8.3 Evaluation of unranked retrieval results There exist many measures to evaluate whether the results of an IR system are good or not. Some popular measures: Precision ( 准确率 ) Recall ( 召回 ) Accuracy.. 29
Precision ( 准确率 ) Precision: What fraction of the returned results are relevant to the user query? Example: A person searches for webpages about Beijing The search engine returns: 5 relevant webpages 5 irrelevant webpages. Precision = 5 / 10 = 0.5 (50 %) P(relevant retrieved) 30
Contengency table ( 应急表 ) Precision and recall, can also be expressed in terms of a contingency table True positive ( 真阳性 ) False positive ( 假阳性 ) False negative ( 假阴性 ) True negative ( 真阴性 ) Precision Recall 31
Recall ( 召回 ) Recall: What fraction of the relevant documents in a collection were returned by the system? Example: A database contains 1000 documents about HITSZ. The user search for documents about HITSZ. Only 100 documents about HITSZ are retrieved. Recall = 100 / 1000 = 0.1 (10 %) P(retrieved relevant) 32
Accuracy ( 精确 ) Accuracy: The number of documents correctly identified (as relevant or non relevant): Example: There are 1000 documents The IR system correctly identifies 300 documents (as relevant or irrelevant) The IR system incorrectly identify 400 documents (as relevant or irrelevant). Accuracy = 600 / 1000 = 0.6 (60 %) 33
Limitations of the accuracy Accuracy has some problem The distribution is skewed ( 偏态分布 ) Generally, over 99.9 % of documents are irrelevant. Thus, an IR system that would consider ALL documents as irrelevant has a high accuracy! But such system would not be good for the user! In a real IR system, identifying some documents as relevant may produce many false positives. 34
Limitations of the accuracy A user can tolerate seeing irrelevant documents in the results, as long as there are some relevant documents. For a Web surfer, precision is the most important, every results on the first page should be relevant (high precision) It is ok if some documents are missing (low recall) 35
Limitations of the accuracy For a professional searcher: precision can be low, but the recall should be high (all documents should be available) Precision and recall are generally inversely related. if precision increases, recall decreases if recall increases, precision decreases 36
F1-measure (F1 度量 ) A trade-off between the precision and recall In this formula the precision and recall have the same importance. We could change the formula to put more importance on the precision or recall 37
8.4 Evaluation of ranked retrieval results The previous measures (precision, recall, F1-measure) do not consider how documents are ranked. But in search engines, documents are usually ranked Thus we need new measures to consider how results are ranked. 38
Precision-recall curve We take the top k documents for a query (e.g. top 10 documents). We create a graph to see how the recall changes when the precision changes. If the i th document is irrelevant, recall is the same, but precision drops. If the j th document is relevant, recall increases and precision increases. 39
Precision-recall curve The curve has a lot of jiggles Solution: use the interpolated precision At a certain recall level r, we only keep the highest precision found. Illustration for the same curve 40
11-point interpolated average precision (11 点插值平均精度 ) With the previous precision-recall graph, we can evaluate the result of a single query. But what if we have more than one query? 11-point interpolated average precision For each information need (query), we calculate the precision for the 11 recall levels: 0.0, 0.1, 0.2 1.0 (as in the previous table). For each recall level, we calculate the average precision. We visualize this using a graph 41
42
Precision-recall graphs can be used to compare two IR systems 43
11-point interpolated average precision 44
Mean average precision MAP: average precision value obtained for the set of top k documents after each relevant document is retrieved. this value is averaged for each information need (query) The number of queries For each query For each result The documents from j to k The number of results 45
Mean average precision This is an interesting measure as it produces a single number rather than a curve. Moreover, it is not necessary to do interpolation or to specify recall levels. The MAP measure can very greatly for different queries. Thus, it is necessary to use many queries for testing an IR system. 46
Several other measures Precision at k: the precision at the k-th document in the search results. R-precision ROC Curve Sensitivity Specificity. 47
8.5 Assessing relevance To assess the relevance of results, we need some testing data (documents, queries, relevance judgments). Documents Queries Relevance judgments Appropriate queries for the test documents may be selected by some domain experts. Providing relevance judgments for all documents is time consuming. Solution: We can use a subset of all documents for evaluating each query 48
Relevance judgments Another problem: relevance judgments of humans may be variable and different for each person. Solution: Measure the agreement between different humans with the Kappa statistic. P(A) = proportion of times the judges agreed P(E) = proportion of times the judges would be expected to agreed by chance. 49
Calculating the Kappa statistic 50
How to interpret the Kappa stat.? In general: > 0.8 good agreement between judges > 0.67 and <0.8 fair agreement < 0.67 weak agreement. The data should not be used for evaluation. For some real data TREC (see book), it was found that the Kappa was generally between 0.67 and 0.8 51
The concept of relevance We have discussed various measures to evaluate IR systems. These measures are useful for tuning the parameters of an IR system to ensure that it returns relevant documents. However, the measures may not reflect what the users really want. So an IR system will be as good as these measures. In practice, it is still quite good. 52
The concept of relevance Should we tune the parameters of an IR by hand ( 手动 )? This would be time-consuming! Search engine companies such as Baidu and Bing will instead use machine learning ( 机器学习 ). Machine learning will automatically search for optimal parameter settings to obtain the best performance for the evaluation measures. e.g. to choose weights for the scoring function 53
Limitations of relevance Limitations of the concept of relevance The relevance of one document is treated as independent of the relevance of other documents. A document is either relevant or irrelevant (there is no in-between) Relevance is viewed as absolute but it varies among people. Testing with a collection of documents or a population may not translate well to other documents or population. 54
Some solutions Define the concepts of relevance using different degrees of relevance: 0 = irrelevant 0.7 = high relevance 1.0 = very high relevance The measure of marginal relevance: how relevant a document is after the user has viewed other documents? e.g. duplicate documents 55
System issues Besides retrieval quality, we may want to evaluate the following aspects of an IR system: How fast does it index? (documents / hour) How fast does it search? (speed / index size) How expressive is the query language? How fast the IR system can answer complex queries? 56
System issues How large is the document collection? (number of documents) Does the collection covers many topics? Most of these criteria are measurable (speed, size ) 57
User utility We would like to evaluate user happiness by considering: relevance, speed, user interface of the system. A happy user finds what he wants. tends to use the same Web search engine again (we can measure how many users come back). There also exists some surveys comparing how many users use each Web search engines. 58
User happiness User happiness is hard to measure. For this reason, relevance is often measured instead. To measure user satisfaction (happiness), we need to do user studies ( 用户研究 ). We ask users to do some tasks with the IR system We observe the persons using the IR system and take notes and calculate measures. We can interview the users. 59
8.6 User satisfaction We may use objective measures ( 客观的措施 ) time to complete a task, the user look at how many pages of results. We may use subjective measures ( 主观的措施 ) score for user satisfaction user comments on the search interface. Both qualitative ( 定性的措施 ) and quantitative measures ( 定量的措施 ) 60
User satisfaction User studies are very useful (e.g. to evaluate the user interface) But user studies are expensive! User studies are also time-consuming. It is difficult to do good user studies Need to design the study well Need to interpret the results. 61
User utility For e-commerce: We may measure the time to purchase. We may measure the fraction of searchers who buy something. e.g. 50 % of searchers bought some product. User happiness may not be the most important. The store owner happiness may be more important (how much money is made). VS 62
User utility For an enterprise, school, or government: The most important metric is probably user productivity. How much time do users spend to find the information that they need? Information security 63
Improving an IR system A/B testing If an IR system is used by many users, it is possible to try different versions of the system with different users. We can compare user satisfaction for these two groups of users to decide which version is better. Usually, this is done to test some small modifications. Usually, only 1 to 10 % of users will be selected randomly to test the new version of the system. 64
Improving an IR system Example We want to improve the scoring function of the IR system. We can ask two groups of users to use different versions of the IR system (A/B testing). We can compare the number of clicks on the top search results for the two versions of the IR system. This can help us to choose the best scoring function. 65
Improving an IR system Why A/B testing is popular? It is easy to do A/B testing, We can do multiple tests to see if multiples changes are good or bad, Results are easy to understand But it requires to have enough users. 66
Results snippets An IR system should show some relevant information to the user about each document found. The standard way is to show a snippet (a short summary of the document) Usually, a snippet consists of: the title of the document, a summary, which is automatically extracted. 67
Snippets Two types of snippets: static snippets: the snippets are always the same (they independent of the query) dynamic snippets: the snippets are different for different queries. Simple approach to create a snippet: Take the first two sentences or 50 words of a document. Or show some metadata such as title, date, and author. The snippet is created when indexing documents. The snippet is static. 68
Snippets Text summarization Some researchers develop techniques to automatically summarize a text. It is a difficult research problem. How? Try to select the most important sentences of a text. First or last sentences of paragraphs. Sentences with key terms 69
Dynamic summaries Display some part of the document. The part of the document should let the user evaluate if the document is useful for his query. Usually, an IR system selects some part that contains many of the query terms, and include some words appearing before and after. Users generally like dynamic summaries. But it is more complicated to provide dynamic summaries than providing static summaries. Could use positional indexes but still need to keep some copy of the documents to generate the snippets. 70
Snippets Generating snippets should be fast. Snippets should not be too long. If a document changes, we should also update the snippets 71
Conclusion Today, Evaluation of information retrieval systems Second assignment. See you next week! 72
References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008 73