信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Similar documents
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Information Retrieval

Retrieval Evaluation. Hongning Wang

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Information Retrieval. Lecture 7

Information Retrieval

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

CSCI 5417 Information Retrieval Systems. Jim Martin!

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Introduction to Information Retrieval

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.

云计算入门 Introduction to Cloud Computing GESC1001

Part 7: Evaluation of IR Systems Francesco Ricci

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

云计算入门 Introduction to Cloud Computing GESC1001

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Information Retrieval

Web Information Retrieval. Exercises Evaluation in information retrieval

CS47300: Web Information Search and Management

CS6322: Information Retrieval Sanda Harabagiu. Lecture 13: Evaluation

CS54701: Information Retrieval

Logitech G302 Daedalus Prime Setup Guide 设置指南

Information Retrieval

Information Retrieval. (M&S Ch 15)

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

数据挖掘 Introduction to Data Mining

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Information Retrieval

Machine Vision Market Analysis of 2015 Isabel Yang

梁永健. W K Leung. 华为企业业务 BG 解决方案销售部 CTO Chief Technology Officer, Solution Sales, Huawei

Presentation Title. By Author The MathWorks, Inc. 1

A Benchmark For Stroke Extraction of Chinese Characters

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Information Retrieval Spring Web retrieval

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Chapter 8. Evaluating Search Engine

2.8 Megapixel industrial camera for extreme environments

5.1 Megapixel machine vision camera with GigE interface

Natural Language Processing

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Lecture 5: Evaluation

Information Retrieval (Part 1)

IEEE 成立于 1884 年, 是全球最大的技术行业协会, 凭借其多样化的出版物 会议 教育论坛和开发标准, 在激励未来几代人进行技术创新方面做出了巨大的贡献, 其数据库产品 IEL(IEEE/IET Electronic Library)

SESEC IV. China Cybersecurity. Standardization Monthly. Newsletter. June 2018

Chapter 6: Information Retrieval and Web Search. An introduction

Informa(on Retrieval

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

测试基础架构 演进之路. 茹炳晟 (Robin Ru) ebay 中国研发中心

Technology: Anti-social Networking 科技 : 反社交网络

PRODUCT SPECIFICATION

TDS - 3. Battery Compartment. LCD Screen. Power Button. Hold Button. Body. Sensor. HM Digital, Inc.

Chapter 2. Architecture of a Search Engine

Search Engine Architecture II

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Oriented Scene Text Detection Revisited. Xiang Bai Huazhong University of Science and Technology

Understanding IO patterns of SSDs

Assignment 1. Assignment 2. Relevance. Performance Evaluation. Retrieval System Evaluation. Evaluate an IR system

Duke University. Information Searching Models. Xianjue Huang. Math of the Universe. Hubert Bray

绝佳的并行处理 - FPGA 加速的根本基石

China Next Generation Internet (CNGI) project and its impact. MA Yan Beijing University of Posts and Telecommunications 2009/08/06.

Informa(on Retrieval

This lecture. Measures for a search engine EVALUATING SEARCH ENGINES. Measuring user happiness. Measures for a search engine

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Exam IST 441 Spring 2014

Oracle 一体化创新云技术 助力智慧政府信息化战略. Copyright* *2014*Oracle*and/or*its*affiliates.*All*rights*reserved.** *

Exam IST 441 Spring 2011

Information Retrieval May 15. Web retrieval

AvalonMiner Raspberry Pi Configuration Guide. AvalonMiner 树莓派配置教程 AvalonMiner Raspberry Pi Configuration Guide

计算机科学与技术专业本科培养计划. Undergraduate Program for Specialty in Computer Science & Technology

The State and Opportunities of HPC Applications in China. Ruibo Wang National University of Defense Technology

dr.ir. D. Hiemstra dr. P.E. van der Vet

Information Retrieval

Final Exam Search Engines ( / ) December 8, 2014

大数据基准测试 : 原理 方法和应用. 詹剑锋 中国科学院计算技术研究所中国科学院大学 可信云服务大会, 北京 INSTITUTE OF COMPUTING TECHNOLOGY

Developing a Test Collection for the Evaluation of Integrated Search Lykke, Marianne; Larsen, Birger; Lund, Haakon; Ingwersen, Peter

Virtual Memory Management for Main-Memory KV Database Using Solid State Disk *

Automatic people tagging for expertise profiling in the enterprise

MeeGo : An Open Source OS Solution For Client Devices

三 依赖注入 (dependency injection) 的学习

Information Retrieval

Bi-monthly report. Tianyi Luo

Introduction to Information Retrieval

Performance Evaluation

XML allows your content to be created in one workflow, at one cost, to reach all your readers XML 的优势 : 只需一次加工和投入, 到达所有读者的手中

From Passages into Elements in XML Retrieval

Information Retrieval

: Operating System 计算机原理与设计

UK-China Science Bridges: R&D of 4G Wireless Mobile Communications. An Introduction of Shanghai Research Center for Wireless Communications (WiCO)

Information Retrieval and Web Search

Information Retrieval

Bing.com scholar. Мобильный портал WAP версия: wap.altmaster.ru

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Chapter 4. Processing Text

我们应该做什么? 告知性分析 未来会发生什么? 预测性分析 为什么会发生 诊断性分析 过去发生了什么? 描述性分析 高级分析 传统 BI. Source: Gartner

modern database systems lecture 4 : information retrieval

Transcription:

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1

Last week We have discussed: A complete search system Today: Brief review of last week Evaluation in an information retrieval system Second assignment 2

Course schedule ( 日程安排 ) Lecture 1 Introduction (Chapter 1) Boolean retrieval Lecture 2 Term vocabulary and posting lists (Chapter 2) Lecture 3 Dictionaries and tolerant retrieval (Chapter 3) Lecture 4 Index construction (Chapter 4) Lecture 5 Scoring, term weighting, the vector space model (Chapter 6) Lecture 6 A complete search system (Chapter 7) Lecture 7 Lecture 8 Evaluation in information retrieval Web search engines, advanced topics, conclusion Final exam 3

LAST WEEK 4

1) Initially, we have a set of documents. 5

2)Linguistic processing is applied to these documents (tokenization, stemming, language detection ) Each document is a set of terms. 6

3) The IR System keeps a copy of each document in a cache ( 缓存 ). This is useful to generate snippets ( 片段 ) 7

Snippet: a short text that accompany each document in the result list of a search engine 8

4) A copy of each document is given to indexers. These programs will create different kind of indexes: positional indexes, indexes for spell correction, structures for inexact retrieval. 9

5) When a user searches using a free-text query, the query parser transforms the query, and spell-correction is applied. 10

6) The indexes are then used to answer the query. Documents are scored and ranked 11

7) A page of results is generated and show to the user. 12

EVALUATION IN AN INFORMATION RETRIEVAL SYSTEM Chapter 8, pdf p. 188 13

Introduction In previous chapters, we have discussed many techniques. Which techniques should be used in an IR system? Should we use stop lists? Should we use stemming? Should we use TF-IDF? 14

Different search engines will show different results BAIDU BING How can we measure the effectiveness of an IR system? 15

Introduction We will discuss: How can we measure the effectiveness of an IR system? Document collections use to evaluate an IR system? Relevant vs non relevant documents Evaluation methodology for unranked retrieval results. Evaluation methodology for ranked retrieval results. 16

User utility We discussed the concept of document relevance ( 文件关联 ) for a query. Relevance is not the most important measure. User utility: What makes the user happy? Speed of response, Size of the index, Relevance of the results, User interface design ( 用户界面设计 ): clarity ( 清晰 ), layout ( 布局 ), responsiveness ( 响应能力 ) of the user interface. he generation of high-quality snippets ( 片段 ) 17

User utility What makes the user happy? Speed of response, Size of the index, Relevance of the results, User interface design ( 用户界面设计 ): clarity ( 清晰 ), layout ( 布局 ), responsiveness ( 响应能力 ) of the user interface. The generation of high-quality snippets ( 片段 ) 18

Snippet: a short text that accompany each document in the result list of a search engine 19

8.1 How to evaluate an IR system? To evaluate an IR system, we can use: a collection of documents a set of test queries a set of relevance judgments indicating which document is relevant for each query. Testing data ( 测试数据 ) Documents Queries Relevance judgments 20

Traditional evaluation approach The standard approach is to consider whether retrieved documents are relevant or not for a query. We use the set of test queries to evaluate whether an IR system returns relevant results. The relevance judgments are also called the ground truth ( 地面的真理 ) or gold standard ( 金标准 ). It is recommended to use at least 50 queries to evaluate an IR system. 21

Information needs vs queries There is a distinction between query and information needs. A user has an information needs (wants to find some information) But the same query may correspond to different information needs. QUERY = PYTHON - an animal? - or a programming language? 22

Information needs vs queries To evaluate an IR system, we will make a simplification. We will suppose that a document is either relevant (1) or irrelevant (0) for a query. But in real-life, a document may be partially relevant. We will ignore this for now. 23

Tuning an information systems IR systems have parameters ( 参数 ) that can be adjusted ( 调整 ). e.g. we can use different scoring functions to retrieve documents Depending on how the parameters are adjusted, the IR system may perform better or worse on the test data. To adjust the parameters, we should use some data that is different from the testing data. Otherwise, it would be like cheating 24

1) Tuning an IR system Training data ( 训练数据 ) IR SYSTEM Results are good? Adjusting the parameters 2) Testing the IR system Testing data ( 测试数据 ) Results are good? 25

8.2 Standard test collections If we develop a new IR system, we could create our own data for training an testing our system. However, there exists some standard collections of documents, which can be used for training/testing and IR system. A few examples 26

The GOV2 collection GOV2: a collection of 25 million webpages. Size of the data: 426 GB http://ir.dcs.gla.ac.uk/test_collections/access_ to_data.html Provided by the University of Glasgow, UK. Useful for researchers and companies working on the development of web search engines and IR systems. But more than 100 times smaller than the number of webpages on the Internet. 27

Reuters Two documents collections of news articles. Reuters-21578: 21,578 news articles. Reuters-RCV1: 806,791 news articles. This data is especially useful for testing system to classify new articles into categories. 28

8.3 Evaluation of unranked retrieval results There exist many measures to evaluate whether the results of an IR system are good or not. Some popular measures: Precision ( 准确率 ) Recall ( 召回 ) Accuracy.. 29

Precision ( 准确率 ) Precision: What fraction of the returned results are relevant to the user query? Example: A person searches for webpages about Beijing The search engine returns: 5 relevant webpages 5 irrelevant webpages. Precision = 5 / 10 = 0.5 (50 %) P(relevant retrieved) 30

Contengency table ( 应急表 ) Precision and recall, can also be expressed in terms of a contingency table True positive ( 真阳性 ) False positive ( 假阳性 ) False negative ( 假阴性 ) True negative ( 真阴性 ) Precision Recall 31

Recall ( 召回 ) Recall: What fraction of the relevant documents in a collection were returned by the system? Example: A database contains 1000 documents about HITSZ. The user search for documents about HITSZ. Only 100 documents about HITSZ are retrieved. Recall = 100 / 1000 = 0.1 (10 %) P(retrieved relevant) 32

Accuracy ( 精确 ) Accuracy: The number of documents correctly identified (as relevant or non relevant): Example: There are 1000 documents The IR system correctly identifies 300 documents (as relevant or irrelevant) The IR system incorrectly identify 400 documents (as relevant or irrelevant). Accuracy = 600 / 1000 = 0.6 (60 %) 33

Limitations of the accuracy Accuracy has some problem The distribution is skewed ( 偏态分布 ) Generally, over 99.9 % of documents are irrelevant. Thus, an IR system that would consider ALL documents as irrelevant has a high accuracy! But such system would not be good for the user! In a real IR system, identifying some documents as relevant may produce many false positives. 34

Limitations of the accuracy A user can tolerate seeing irrelevant documents in the results, as long as there are some relevant documents. For a Web surfer, precision is the most important, every results on the first page should be relevant (high precision) It is ok if some documents are missing (low recall) 35

Limitations of the accuracy For a professional searcher: precision can be low, but the recall should be high (all documents should be available) Precision and recall are generally inversely related. if precision increases, recall decreases if recall increases, precision decreases 36

F1-measure (F1 度量 ) A trade-off between the precision and recall In this formula the precision and recall have the same importance. We could change the formula to put more importance on the precision or recall 37

8.4 Evaluation of ranked retrieval results The previous measures (precision, recall, F1-measure) do not consider how documents are ranked. But in search engines, documents are usually ranked Thus we need new measures to consider how results are ranked. 38

Precision-recall curve We take the top k documents for a query (e.g. top 10 documents). We create a graph to see how the recall changes when the precision changes. If the i th document is irrelevant, recall is the same, but precision drops. If the j th document is relevant, recall increases and precision increases. 39

Precision-recall curve The curve has a lot of jiggles Solution: use the interpolated precision At a certain recall level r, we only keep the highest precision found. Illustration for the same curve 40

11-point interpolated average precision (11 点插值平均精度 ) With the previous precision-recall graph, we can evaluate the result of a single query. But what if we have more than one query? 11-point interpolated average precision For each information need (query), we calculate the precision for the 11 recall levels: 0.0, 0.1, 0.2 1.0 (as in the previous table). For each recall level, we calculate the average precision. We visualize this using a graph 41

42

Precision-recall graphs can be used to compare two IR systems 43

11-point interpolated average precision 44

Mean average precision MAP: average precision value obtained for the set of top k documents after each relevant document is retrieved. this value is averaged for each information need (query) The number of queries For each query For each result The documents from j to k The number of results 45

Mean average precision This is an interesting measure as it produces a single number rather than a curve. Moreover, it is not necessary to do interpolation or to specify recall levels. The MAP measure can very greatly for different queries. Thus, it is necessary to use many queries for testing an IR system. 46

Several other measures Precision at k: the precision at the k-th document in the search results. R-precision ROC Curve Sensitivity Specificity. 47

8.5 Assessing relevance To assess the relevance of results, we need some testing data (documents, queries, relevance judgments). Documents Queries Relevance judgments Appropriate queries for the test documents may be selected by some domain experts. Providing relevance judgments for all documents is time consuming. Solution: We can use a subset of all documents for evaluating each query 48

Relevance judgments Another problem: relevance judgments of humans may be variable and different for each person. Solution: Measure the agreement between different humans with the Kappa statistic. P(A) = proportion of times the judges agreed P(E) = proportion of times the judges would be expected to agreed by chance. 49

Calculating the Kappa statistic 50

How to interpret the Kappa stat.? In general: > 0.8 good agreement between judges > 0.67 and <0.8 fair agreement < 0.67 weak agreement. The data should not be used for evaluation. For some real data TREC (see book), it was found that the Kappa was generally between 0.67 and 0.8 51

The concept of relevance We have discussed various measures to evaluate IR systems. These measures are useful for tuning the parameters of an IR system to ensure that it returns relevant documents. However, the measures may not reflect what the users really want. So an IR system will be as good as these measures. In practice, it is still quite good. 52

The concept of relevance Should we tune the parameters of an IR by hand ( 手动 )? This would be time-consuming! Search engine companies such as Baidu and Bing will instead use machine learning ( 机器学习 ). Machine learning will automatically search for optimal parameter settings to obtain the best performance for the evaluation measures. e.g. to choose weights for the scoring function 53

Limitations of relevance Limitations of the concept of relevance The relevance of one document is treated as independent of the relevance of other documents. A document is either relevant or irrelevant (there is no in-between) Relevance is viewed as absolute but it varies among people. Testing with a collection of documents or a population may not translate well to other documents or population. 54

Some solutions Define the concepts of relevance using different degrees of relevance: 0 = irrelevant 0.7 = high relevance 1.0 = very high relevance The measure of marginal relevance: how relevant a document is after the user has viewed other documents? e.g. duplicate documents 55

System issues Besides retrieval quality, we may want to evaluate the following aspects of an IR system: How fast does it index? (documents / hour) How fast does it search? (speed / index size) How expressive is the query language? How fast the IR system can answer complex queries? 56

System issues How large is the document collection? (number of documents) Does the collection covers many topics? Most of these criteria are measurable (speed, size ) 57

User utility We would like to evaluate user happiness by considering: relevance, speed, user interface of the system. A happy user finds what he wants. tends to use the same Web search engine again (we can measure how many users come back). There also exists some surveys comparing how many users use each Web search engines. 58

User happiness User happiness is hard to measure. For this reason, relevance is often measured instead. To measure user satisfaction (happiness), we need to do user studies ( 用户研究 ). We ask users to do some tasks with the IR system We observe the persons using the IR system and take notes and calculate measures. We can interview the users. 59

8.6 User satisfaction We may use objective measures ( 客观的措施 ) time to complete a task, the user look at how many pages of results. We may use subjective measures ( 主观的措施 ) score for user satisfaction user comments on the search interface. Both qualitative ( 定性的措施 ) and quantitative measures ( 定量的措施 ) 60

User satisfaction User studies are very useful (e.g. to evaluate the user interface) But user studies are expensive! User studies are also time-consuming. It is difficult to do good user studies Need to design the study well Need to interpret the results. 61

User utility For e-commerce: We may measure the time to purchase. We may measure the fraction of searchers who buy something. e.g. 50 % of searchers bought some product. User happiness may not be the most important. The store owner happiness may be more important (how much money is made). VS 62

User utility For an enterprise, school, or government: The most important metric is probably user productivity. How much time do users spend to find the information that they need? Information security 63

Improving an IR system A/B testing If an IR system is used by many users, it is possible to try different versions of the system with different users. We can compare user satisfaction for these two groups of users to decide which version is better. Usually, this is done to test some small modifications. Usually, only 1 to 10 % of users will be selected randomly to test the new version of the system. 64

Improving an IR system Example We want to improve the scoring function of the IR system. We can ask two groups of users to use different versions of the IR system (A/B testing). We can compare the number of clicks on the top search results for the two versions of the IR system. This can help us to choose the best scoring function. 65

Improving an IR system Why A/B testing is popular? It is easy to do A/B testing, We can do multiple tests to see if multiples changes are good or bad, Results are easy to understand But it requires to have enough users. 66

Results snippets An IR system should show some relevant information to the user about each document found. The standard way is to show a snippet (a short summary of the document) Usually, a snippet consists of: the title of the document, a summary, which is automatically extracted. 67

Snippets Two types of snippets: static snippets: the snippets are always the same (they independent of the query) dynamic snippets: the snippets are different for different queries. Simple approach to create a snippet: Take the first two sentences or 50 words of a document. Or show some metadata such as title, date, and author. The snippet is created when indexing documents. The snippet is static. 68

Snippets Text summarization Some researchers develop techniques to automatically summarize a text. It is a difficult research problem. How? Try to select the most important sentences of a text. First or last sentences of paragraphs. Sentences with key terms 69

Dynamic summaries Display some part of the document. The part of the document should let the user evaluate if the document is useful for his query. Usually, an IR system selects some part that contains many of the query terms, and include some words appearing before and after. Users generally like dynamic summaries. But it is more complicated to provide dynamic summaries than providing static summaries. Could use positional indexes but still need to keep some copy of the documents to generate the snippets. 70

Snippets Generating snippets should be fast. Snippets should not be too long. If a document changes, we should also update the snippets 71

Conclusion Today, Evaluation of information retrieval systems Second assignment. See you next week! 72

References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008 73