信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Size: px

Start display at page:

Download "信息检索与搜索引擎 Introduction to Information Retrieval GESC1007"

Melanie Hodge
5 years ago
Views:

1 信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities Spring

2 Last week We have discussed: Evaluation in an information retrieval system Today: Web search engines Solution of second assignment About the final exam 2

Course schedule ( 日程安排 ) Lecture 1 Introduction (Chapter 1) Boolean retrieval Lecture 2 Term vocabulary and posting lists (Chapter 2) Lecture 3 Dictionaries and tolerant retrieval (Chapter 3) Lecture

3 Course schedule ( 日程安排 ) Lecture 1 Introduction (Chapter 1) Boolean retrieval Lecture 2 Term vocabulary and posting lists (Chapter 2) Lecture 3 Dictionaries and tolerant retrieval (Chapter 3) Lecture 4 Index construction (Chapter 4) Lecture 5 Scoring, term weighting, the vector space model (Chapter 6) Lecture 6 A complete search system (Chapter 7) Lecture 7 Lecture 8 Evaluation in information retrieval Web search engines, advanced topics, conclusion Final exam 3

4 WEB SEARCH ENGINES 4

5 19.1 The Web What is special about the Web? The number of documents (very large) Lack of coordination in the creation of the documents, Diversity of background and motives of participants. 5

6 The Web The Web is a set of webpages ( 网页 ) Webpages are created using a language called HTML Webpage HTML 6

7 The Web Browser Webpages are stored on servers ( 服务器 ) To access a webpage, one must use a software called a Web browser ( 浏览器 ) Internet SERVER of HITSZ Home 7

8 The Web Browser Webpages are stored on servers ( 服务器 ) To access a webpage, one must use a software called a Web browser ( 浏览器 ) Internet SERVER of HITSZ Home Webpages are sent over the internet using the HTTP protocol (HTTP 协议 ) 8

9 The Web The idea of the Web: each webpage contain links to other webpages (hyperlinks - 超链接 ). Each webpage has an address (URL) e.g. Creating a webpage is not difficult. Webpages have become one of the best way to supply and consume information. 9

10 The Web Billions of webpages containing information. But if we cannot search this information, it is useless. Historically, two ways of searching for information: Search engines (Baidu, Bing, etc.) Directories (Yahoo!, etc.) 10

11 Web directories ( 网络目录 ) Web directory: a list of websites, separated by categories. 11

12 Web directories ( 网络目录 ) A Web directory contains only the best webpages for each category. Problems: Web directories are created by humans. This takes a lot of time. It is not convenient for searching. A user need to know how to find information within the categories. There can be thousands of categories. Information in categories is often old For this reason, Web directories have mostly disapeared. 12

13 Web search engines Baidu, Bing, etc. They adapt information retrieval techniques to search billions of documents. Adapted in terms of: Indexing, Processing queries, Ranking documents 13

14 Web search engines Why are they popular? ability to quickly answer queries. ability to index millions of documents. almost always up-to-date. Fifteen years ago, results returned by Web search engines were not very good Novel ranking techniques ( 排序技术 ) and spam-fighting techniques ( 反垃圾邮件技术 ) have been proposed to obtain better results 14

15 19.2 Web characteristics The Web is mainly decentralized ( 分散 ). Many languages. Many different types of content. Some webpages contains only pictures and no text. The Web contains a lot of non reliable information. How can a search engine knows which websites can be trusted? 15

16 Size of the Web 1995: 30 million webpages indexed by AltaVista 2017: 4.48 billion webpages Note: only static webpages are counted. Dynamic webpage: the content is generated in realtime for the user. 16

17 The Web graph The Web can be viewed as a graph ( 图 ) Each webpage is a vertex ( 顶点 ) A link between two webpages is an edge ( 图的边 ) The Web is a directed graph ( 有向图 ) Webpages: A,B,C,, F 17

18 The Web graph Two types of links: In-links: links that go to a page Out-links: links that leave a page Node B has 3 in-links has 1 out-link 18

19 The Web graph Not all web pages are equally popular Many web pages have few in-links Few web pages have many in-links The number of in-links per website follows a power law distribution ( 幂律分布 ) Number of webpages Number of in-links 19

20 Spam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate ( 房地产 ) Thus, many people modify their website to try to appear first in the search results. e.g. write multiple times Beijing real-estate in a webpage to increase the term frequency. e.g. write invisible text using the background color of the webpage (e.g. white) 20

21 Spam detection Nowadays, search engines use many sophisticated methods to detect spam (repeated keywords, etc.). Websites that are trying to cheat may be blocked from search engines. Thus, some people have developed new techniques to cheat search engines ( 欺骗搜索引擎 ) 21

22 Cloaking ( 伪装 ) One such technique is cloaking. Some websites try to cheat by showing different content to search engines and users. This is a problem that did not exist in traditional IR. 22

23 Paid inclusion Paid inclusion: a website can also pay a search engine to appear high in the results. Some search engines do not allow paid inclusion. 23

24 24

25 Doorway page Doorway page: a page containing carefully chosen text to rank highly in search engines for some keywords. the page then links to another page containing commercial content. a website may have many doorway pages. Doorway page Doorway page Another webpage 25

26 Link analysis To reduce the problem of spam on the Web, many search engine perform link analysis. Basic idea: to rank a page higher or treat it as more reliable if it has many in-links. e.g. PageRank algorithm 26

27 Link analysis But some people create fake links to increase the popularity of their webpages. There is thus a continuing battle between spammers and search engines. 27

28 19.3 Advertising ( 广告 ) Two main advertisement models: 1) cost per view: The goal is to show some content to the user (branding). An image is typically used. A company may pay to display the image 1000 times. 28

29 Advertising ( 广告 ) Two main advertisement models: 2) cost per click: The goal is that some people click on an advertisement to visit the website of the advertiser (initiate a transaction). The website may ask the person to buy something. An image or text may be used with a link. A company may pay for 1000 clicks. 29

30 19.3 Advertising ( 广告 ) Today, many search engines earn money from advertising. Some will display search results and advertisement separately. Search results Sponsored search results Some other search engines will combine search results and advertisement. 30

31 Search results Sponsored search results 31

32 Online advertisement networks There are many advertisement networks : Bing Ads: provides pay per-click advertisements for Bing and Yahoo, Adwords: sells advertisements on various websites. 32

33 Click-spam Click spam: a company clicks on the advertisement of its competitors to spend their money. This may be done using some automatic software. A search engine must use some techniques to block click spam. 33

34 Example: AllAdvantage ( ) It was an online advertisement company. 34

35 19.4 Search user experience ( 用户体验 ) It is also important to understand users of search engines. For traditional IR systems: Users often received a training about how to search and write queries. For Web search engines: Users may not know or care about how to write queries. Usually, people use 2 or 3 keywords in a query. Usually people do not use special operators (wildcard queries, Boolean operators ) 35

36 Search user experience ( 用户体验 ) The more people use a search engine, the more money it can earn. How a search engine can get more users? By increasing the precision in the first few results, By updating the index frequently, By having a larger index, By offering a website that is simple and easy to use, and that is very fast. A user can quickly find what he is looking for. 36

37 Three types of user queries 1) Informational queries: seek general information on a broad topic. e.g. how to play piano There is not a single webpage that contains all the information that the user wants. The user generally want to combine information from several webpages. 37

38 Three types of user queries 2) Navigational queries: seek the website or home page of a given entity. e.g. find the webpage of Huawei( 华为 ) The user expects that the first result is the webpage of the entity (e.g. Huawei) The user only needs one document. He wants a very high precision (1). 38

39 Three types of user queries 3) Transactional queries: the user wants to make a transaction. e.g. reserve a hotel room in Guangzhou, e.g. buy train tickets The search engine should provides links to service providers. 39

40 Three types of user queries For a given query, it can be difficult to identify the type of the query. Identifying the type of a query is useful: for selecting the most relevant results, for displaying relevant advertisements (e.g. advertisement about train tickets) 40

41 41

42 42

43 43

44 44

45 Components of a Web search engine ( 网络爬虫 ) 45

46 Index size How can we compare the sizes of the indexes of two search engines (e.g. Baidu vs Bing)? This may be difficult to evaluate A search engine may only index the first few thousands words in a page. A search engine may display a page in its results that is not in its index (because some other page in its index links to that page) Search engines may organize their indexes in tiers using tiered indexes. For general queries, only the main page of a website may be shown and other pages may not be shown. 46

47 Index size Some techniques have been developped to compare the size of search engines indexes. Hypothesis: each search engine indexes only one part of the Web, chosen randomly. The Capture-recapture method 47

48 Capture-recapture method Two search engines E1 and E2. Take a page from E1 and check if it is in E2 This gives a ratio x Take a page from E2 and check if it is in E1 This gives a ratio y If E1 and E2 are independent and uniform random subsets of the Web, we should have: More details in the book 48

49 19.6 Near-duplicates ( 近似重复 ) Another issue: the Web may contain multiple copies of the same webpage. Up to 40% of the webpages are duplicates ( 重复 ) of other pages. Some of these of these copies are legitimate ( 合法的 ). Others are not. Search engines try to avoid indexing duplicates to reduce the size of their indexes. 49

50 Detecting duplicates How to detect duplicates? We do not want to compare billions of webpages with each other. Simple approach: calculate a fingerprint (hash) for each webpage that is a number. If two pages have the same fingerprint, they may be duplicates, so we need to compare them. If they are duplicates, only one of them is indexed. 50

51 20 Web crawling (Web 信息发现 ) Web crawling: the process by which a search engine gather pages from the Web to index them. Goal: Collect information about webpages, Collect information about links between webpages, Do this quickly! 51

52 Web crawler ( 网络爬虫 ) A web crawler must have the following features ( 特征 ): 1) Robustness: Several websites try to cheat and may try to generate an infinite number of pages to mislead web crawlers. Web-crawlers must be able to avoid these «traps» ( 陷阱 ). 52

53 Web crawler ( 网络爬虫 ) 2) Politeness ( 礼貌 ): A Web crawler should be polite. It should not visit a website too often. Otherwise, the owner of the website may not be happy. 3) Efficient The Web crawler should be able to efficiently index a huge amount of webpages. 53

54 Web crawler ( 网络爬虫 ) 4) Quality The Web crawler should try to index the high quality or most useful webpages first The Web crawler must be able to assign different priority levels to different webpages. 5) Extensible A Web crawler should work with different technologies, different languages, different data format, etc. 54

55 Crawling How a Web crawler indexes websites? The crawler begins with one or more URL (web page addresses). The crawler visit one of these webpages. The crawler extracts the text and links. The text is indexed. The links are used to find more webpages. The crawler then continue visiting other webpages. 55

56 Crawling A Web crawler should not visit the same webpage twice. How fast can it be to crawl the Web? 4 billion webpages 1 month = 1540 webpages / second! A Web Crawler may be designed to visit popular websites more often than less popular websites. 56

57 Robot exclusion Some people do not want that Web crawlers index their website. To do this, we can put a file robots.txt on a website to tell the Web Crawlers to ignore the website. Name of a search engine 57

58 Crawling Generally, a search engine will have many computers working as Web crawlers. These Web crawlers could be located in different locations: China, Europe, America, etc. These Web crawlers must work together. They must split the work and avoid visiting the same websites multiple times. This can be challenging! 58

59 Distributed index For a Web search engine, the index may be very large. Moreover, many users may want to access the index at the same time. Thus an index will be stored on several computers. 59

60 Link analysis Many search engines consider the links between websites as an important information to rank webpages. Link analysis: analyzing the links between websites to derive useful information. A link from a website A to another website B is considered as an endorsement ( 认可 ) of the website B by A. A B 60

61 Link analysis When analyzing links, we can also analyze the context of each link in a webpage (the text of the link). e.g. The real-estate market in Shenzhen ( ) This is useful because the webpage B may not provide an accurate description of itself. A B 61

62 Link analysis In fact, there is often a gap between the terms in a webpage and how web users would describe a page. The text used in a link is useful. But some terms may not be useful. e.g. Click here for information about Shenzhen. We can use the TF-IDF measure to filter unimportant words. A B 62

63 Link analysis Thanks to the analysis of the text of links: If we search «big blue», we may find the webpage of IBM. This is great. But there can be some side-effects. For example, if we search «miserable failure» we can find the page of George W. Bush. 63

64 This is because many people have purposely linked to the page of George W Bush. with the text «miserable failure» to fool the search engines. 64

65 Another example 65

66 Link analysis Search engines try to use various techniques to avoid this problem. Some search engines will not only consider the text of links, but also the text before and after a link. 66

67 FINAL EXAM 67

68 68

69 Some questions 69

70 Some questions 70

71 Some questions 71

72 Conclusion Today, Web search engine Wish you a good preparation for the final exam! 再见! 72

73 References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press,

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Introduction Philippe Fournier-Viger