信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Size: px
Start display at page:

Download "信息检索与搜索引擎 Introduction to Information Retrieval GESC1007"

Transcription

1 信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities Spring

2 Last week We have discussed: Evaluation in an information retrieval system Today: Web search engines Solution of second assignment About the final exam 2

3 Course schedule ( 日程安排 ) Lecture 1 Introduction (Chapter 1) Boolean retrieval Lecture 2 Term vocabulary and posting lists (Chapter 2) Lecture 3 Dictionaries and tolerant retrieval (Chapter 3) Lecture 4 Index construction (Chapter 4) Lecture 5 Scoring, term weighting, the vector space model (Chapter 6) Lecture 6 A complete search system (Chapter 7) Lecture 7 Lecture 8 Evaluation in information retrieval Web search engines, advanced topics, conclusion Final exam 3

4 WEB SEARCH ENGINES 4

5 19.1 The Web What is special about the Web? The number of documents (very large) Lack of coordination in the creation of the documents, Diversity of background and motives of participants. 5

6 The Web The Web is a set of webpages ( 网页 ) Webpages are created using a language called HTML Webpage HTML 6

7 The Web Browser Webpages are stored on servers ( 服务器 ) To access a webpage, one must use a software called a Web browser ( 浏览器 ) Internet SERVER of HITSZ Home 7

8 The Web Browser Webpages are stored on servers ( 服务器 ) To access a webpage, one must use a software called a Web browser ( 浏览器 ) Internet SERVER of HITSZ Home Webpages are sent over the internet using the HTTP protocol (HTTP 协议 ) 8

9 The Web The idea of the Web: each webpage contain links to other webpages (hyperlinks - 超链接 ). Each webpage has an address (URL) e.g. Creating a webpage is not difficult. Webpages have become one of the best way to supply and consume information. 9

10 The Web Billions of webpages containing information. But if we cannot search this information, it is useless. Historically, two ways of searching for information: Search engines (Baidu, Bing, etc.) Directories (Yahoo!, etc.) 10

11 Web directories ( 网络目录 ) Web directory: a list of websites, separated by categories. 11

12 Web directories ( 网络目录 ) A Web directory contains only the best webpages for each category. Problems: Web directories are created by humans. This takes a lot of time. It is not convenient for searching. A user need to know how to find information within the categories. There can be thousands of categories. Information in categories is often old For this reason, Web directories have mostly disapeared. 12

13 Web search engines Baidu, Bing, etc. They adapt information retrieval techniques to search billions of documents. Adapted in terms of: Indexing, Processing queries, Ranking documents 13

14 Web search engines Why are they popular? ability to quickly answer queries. ability to index millions of documents. almost always up-to-date. Fifteen years ago, results returned by Web search engines were not very good Novel ranking techniques ( 排序技术 ) and spam-fighting techniques ( 反垃圾邮件技术 ) have been proposed to obtain better results 14

15 19.2 Web characteristics The Web is mainly decentralized ( 分散 ). Many languages. Many different types of content. Some webpages contains only pictures and no text. The Web contains a lot of non reliable information. How can a search engine knows which websites can be trusted? 15

16 Size of the Web 1995: 30 million webpages indexed by AltaVista 2017: 4.48 billion webpages Note: only static webpages are counted. Dynamic webpage: the content is generated in realtime for the user. 16

17 The Web graph The Web can be viewed as a graph ( 图 ) Each webpage is a vertex ( 顶点 ) A link between two webpages is an edge ( 图的边 ) The Web is a directed graph ( 有向图 ) Webpages: A,B,C,, F 17

18 The Web graph Two types of links: In-links: links that go to a page Out-links: links that leave a page Node B has 3 in-links has 1 out-link 18

19 The Web graph Not all web pages are equally popular Many web pages have few in-links Few web pages have many in-links The number of in-links per website follows a power law distribution ( 幂律分布 ) Number of webpages Number of in-links 19

20 Spam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate ( 房地产 ) Thus, many people modify their website to try to appear first in the search results. e.g. write multiple times Beijing real-estate in a webpage to increase the term frequency. e.g. write invisible text using the background color of the webpage (e.g. white) 20

21 Spam detection Nowadays, search engines use many sophisticated methods to detect spam (repeated keywords, etc.). Websites that are trying to cheat may be blocked from search engines. Thus, some people have developed new techniques to cheat search engines ( 欺骗搜索引擎 ) 21

22 Cloaking ( 伪装 ) One such technique is cloaking. Some websites try to cheat by showing different content to search engines and users. This is a problem that did not exist in traditional IR. 22

23 Paid inclusion Paid inclusion: a website can also pay a search engine to appear high in the results. Some search engines do not allow paid inclusion. 23

24 24

25 Doorway page Doorway page: a page containing carefully chosen text to rank highly in search engines for some keywords. the page then links to another page containing commercial content. a website may have many doorway pages. Doorway page Doorway page Another webpage 25

26 Link analysis To reduce the problem of spam on the Web, many search engine perform link analysis. Basic idea: to rank a page higher or treat it as more reliable if it has many in-links. e.g. PageRank algorithm 26

27 Link analysis But some people create fake links to increase the popularity of their webpages. There is thus a continuing battle between spammers and search engines. 27

28 19.3 Advertising ( 广告 ) Two main advertisement models: 1) cost per view: The goal is to show some content to the user (branding). An image is typically used. A company may pay to display the image 1000 times. 28

29 Advertising ( 广告 ) Two main advertisement models: 2) cost per click: The goal is that some people click on an advertisement to visit the website of the advertiser (initiate a transaction). The website may ask the person to buy something. An image or text may be used with a link. A company may pay for 1000 clicks. 29

30 19.3 Advertising ( 广告 ) Today, many search engines earn money from advertising. Some will display search results and advertisement separately. Search results Sponsored search results Some other search engines will combine search results and advertisement. 30

31 Search results Sponsored search results 31

32 Online advertisement networks There are many advertisement networks : Bing Ads: provides pay per-click advertisements for Bing and Yahoo, Adwords: sells advertisements on various websites. 32

33 Click-spam Click spam: a company clicks on the advertisement of its competitors to spend their money. This may be done using some automatic software. A search engine must use some techniques to block click spam. 33

34 Example: AllAdvantage ( ) It was an online advertisement company. 34

35 19.4 Search user experience ( 用户体验 ) It is also important to understand users of search engines. For traditional IR systems: Users often received a training about how to search and write queries. For Web search engines: Users may not know or care about how to write queries. Usually, people use 2 or 3 keywords in a query. Usually people do not use special operators (wildcard queries, Boolean operators ) 35

36 Search user experience ( 用户体验 ) The more people use a search engine, the more money it can earn. How a search engine can get more users? By increasing the precision in the first few results, By updating the index frequently, By having a larger index, By offering a website that is simple and easy to use, and that is very fast. A user can quickly find what he is looking for. 36

37 Three types of user queries 1) Informational queries: seek general information on a broad topic. e.g. how to play piano There is not a single webpage that contains all the information that the user wants. The user generally want to combine information from several webpages. 37

38 Three types of user queries 2) Navigational queries: seek the website or home page of a given entity. e.g. find the webpage of Huawei( 华为 ) The user expects that the first result is the webpage of the entity (e.g. Huawei) The user only needs one document. He wants a very high precision (1). 38

39 Three types of user queries 3) Transactional queries: the user wants to make a transaction. e.g. reserve a hotel room in Guangzhou, e.g. buy train tickets The search engine should provides links to service providers. 39

40 Three types of user queries For a given query, it can be difficult to identify the type of the query. Identifying the type of a query is useful: for selecting the most relevant results, for displaying relevant advertisements (e.g. advertisement about train tickets) 40

41 41

42 42

43 43

44 44

45 Components of a Web search engine ( 网络爬虫 ) 45

46 Index size How can we compare the sizes of the indexes of two search engines (e.g. Baidu vs Bing)? This may be difficult to evaluate A search engine may only index the first few thousands words in a page. A search engine may display a page in its results that is not in its index (because some other page in its index links to that page) Search engines may organize their indexes in tiers using tiered indexes. For general queries, only the main page of a website may be shown and other pages may not be shown. 46

47 Index size Some techniques have been developped to compare the size of search engines indexes. Hypothesis: each search engine indexes only one part of the Web, chosen randomly. The Capture-recapture method 47

48 Capture-recapture method Two search engines E1 and E2. Take a page from E1 and check if it is in E2 This gives a ratio x Take a page from E2 and check if it is in E1 This gives a ratio y If E1 and E2 are independent and uniform random subsets of the Web, we should have: More details in the book 48

49 19.6 Near-duplicates ( 近似重复 ) Another issue: the Web may contain multiple copies of the same webpage. Up to 40% of the webpages are duplicates ( 重复 ) of other pages. Some of these of these copies are legitimate ( 合法的 ). Others are not. Search engines try to avoid indexing duplicates to reduce the size of their indexes. 49

50 Detecting duplicates How to detect duplicates? We do not want to compare billions of webpages with each other. Simple approach: calculate a fingerprint (hash) for each webpage that is a number. If two pages have the same fingerprint, they may be duplicates, so we need to compare them. If they are duplicates, only one of them is indexed. 50

51 20 Web crawling (Web 信息发现 ) Web crawling: the process by which a search engine gather pages from the Web to index them. Goal: Collect information about webpages, Collect information about links between webpages, Do this quickly! 51

52 Web crawler ( 网络爬虫 ) A web crawler must have the following features ( 特征 ): 1) Robustness: Several websites try to cheat and may try to generate an infinite number of pages to mislead web crawlers. Web-crawlers must be able to avoid these «traps» ( 陷阱 ). 52

53 Web crawler ( 网络爬虫 ) 2) Politeness ( 礼貌 ): A Web crawler should be polite. It should not visit a website too often. Otherwise, the owner of the website may not be happy. 3) Efficient The Web crawler should be able to efficiently index a huge amount of webpages. 53

54 Web crawler ( 网络爬虫 ) 4) Quality The Web crawler should try to index the high quality or most useful webpages first The Web crawler must be able to assign different priority levels to different webpages. 5) Extensible A Web crawler should work with different technologies, different languages, different data format, etc. 54

55 Crawling How a Web crawler indexes websites? The crawler begins with one or more URL (web page addresses). The crawler visit one of these webpages. The crawler extracts the text and links. The text is indexed. The links are used to find more webpages. The crawler then continue visiting other webpages. 55

56 Crawling A Web crawler should not visit the same webpage twice. How fast can it be to crawl the Web? 4 billion webpages 1 month = 1540 webpages / second! A Web Crawler may be designed to visit popular websites more often than less popular websites. 56

57 Robot exclusion Some people do not want that Web crawlers index their website. To do this, we can put a file robots.txt on a website to tell the Web Crawlers to ignore the website. Name of a search engine 57

58 Crawling Generally, a search engine will have many computers working as Web crawlers. These Web crawlers could be located in different locations: China, Europe, America, etc. These Web crawlers must work together. They must split the work and avoid visiting the same websites multiple times. This can be challenging! 58

59 Distributed index For a Web search engine, the index may be very large. Moreover, many users may want to access the index at the same time. Thus an index will be stored on several computers. 59

60 Link analysis Many search engines consider the links between websites as an important information to rank webpages. Link analysis: analyzing the links between websites to derive useful information. A link from a website A to another website B is considered as an endorsement ( 认可 ) of the website B by A. A B 60

61 Link analysis When analyzing links, we can also analyze the context of each link in a webpage (the text of the link). e.g. The real-estate market in Shenzhen ( ) This is useful because the webpage B may not provide an accurate description of itself. A B 61

62 Link analysis In fact, there is often a gap between the terms in a webpage and how web users would describe a page. The text used in a link is useful. But some terms may not be useful. e.g. Click here for information about Shenzhen. We can use the TF-IDF measure to filter unimportant words. A B 62

63 Link analysis Thanks to the analysis of the text of links: If we search «big blue», we may find the webpage of IBM. This is great. But there can be some side-effects. For example, if we search «miserable failure» we can find the page of George W. Bush. 63

64 This is because many people have purposely linked to the page of George W Bush. with the text «miserable failure» to fool the search engines. 64

65 Another example 65

66 Link analysis Search engines try to use various techniques to avoid this problem. Some search engines will not only consider the text of links, but also the text before and after a link. 66

67 FINAL EXAM 67

68 68

69 Some questions 69

70 Some questions 70

71 Some questions 71

72 Conclusion Today, Web search engine Wish you a good preparation for the final exam! 再见! 72

73 References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press,

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Introduction Philippe Fournier-Viger

More information

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Last week We have discussed: A

More information

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Last week We have discussed about:

More information

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2018 1 Last week What is Information Retrieval

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Last week We have discussed in

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans.

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans. 1 After WWW protocol was introduced in Internet in the early 1990s and the number of web servers started to grow, the first technology that appeared to be able to locate them were Internet listings, also

More information

Lec 8: Adaptive Information Retrieval 2

Lec 8: Adaptive Information Retrieval 2 Lec 8: Adaptive Information Retrieval 2 Advaith Siddharthan Introduction to Information Retrieval by Manning, Raghavan & Schütze. Website: http://nlp.stanford.edu/ir-book/ Linear Algebra Revision Vectors:

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

CS/INFO 1305 Summer 2009

CS/INFO 1305 Summer 2009 Information Retrieval Information Retrieval (Search) IR Search Using a computer to find relevant pieces of information Text search Idea popularized in the article As We May Think by Vannevar Bush in 1945

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability

More information

power up your business SEO (SEARCH ENGINE OPTIMISATION)

power up your business SEO (SEARCH ENGINE OPTIMISATION) SEO (SEARCH ENGINE OPTIMISATION) SEO (SEARCH ENGINE OPTIMISATION) The visibility of your business when a customer is looking for services that you offer is important. The first port of call for most people

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Overview Overview Introduction Classic

More information

Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks

Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks Information Networks Hacettepe University Department of Information Management DOK 422: Information Networks Search engines Some Slides taken from: Ray Larson Search engines Web Crawling Web Search Engines

More information

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Administrative. Web crawlers. Web Crawlers and Link Analysis! Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt

More information

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454 Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search

More information

SEOHUNK INTERNATIONAL D-62, Basundhara Apt., Naharkanta, Hanspal, Bhubaneswar, India

SEOHUNK INTERNATIONAL D-62, Basundhara Apt., Naharkanta, Hanspal, Bhubaneswar, India SEOHUNK INTERNATIONAL D-62, Basundhara Apt., Naharkanta, Hanspal, Bhubaneswar, India 752101. p: 305-403-9683 w: www.seohunkinternational.com e: info@seohunkinternational.com DOMAIN INFORMATION: S No. Details

More information

Machine Vision Market Analysis of 2015 Isabel Yang

Machine Vision Market Analysis of 2015 Isabel Yang Machine Vision Market Analysis of 2015 Isabel Yang CHINA Machine Vision Union Content 1 1.Machine Vision Market Analysis of 2015 Revenue of Machine Vision Industry in China 4,000 3,500 2012-2015 (Unit:

More information

How Does a Search Engine Work? Part 1

How Does a Search Engine Work? Part 1 How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling

More information

Contractors Guide to Search Engine Optimization

Contractors Guide to Search Engine Optimization Contractors Guide to Search Engine Optimization CONTENTS What is Search Engine Optimization (SEO)? Why Do Businesses Need SEO (If They Want To Generate Business Online)? Which Search Engines Should You

More information

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University Web Search Basics Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Information Retrieval. Lecture 10 - Web crawling

Information Retrieval. Lecture 10 - Web crawling Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the

More information

World Wide Web has specific challenges and opportunities

World Wide Web has specific challenges and opportunities 6. Web Search Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has

More information

CS47300 Web Information Search and Management

CS47300 Web Information Search and Management CS47300 Web Information Search and Management Search Engine Optimization Prof. Chris Clifton 31 October 2018 What is Search Engine Optimization? 90% of search engine clickthroughs are on the first page

More information

6 WAYS Google s First Page

6 WAYS Google s First Page 6 WAYS TO Google s First Page FREE EBOOK 2 CONTENTS 03 Intro 06 Search Engine Optimization 08 Search Engine Marketing 10 Start a Business Blog 12 Get Listed on Google Maps 15 Create Online Directory Listing

More information

如何查看 Cache Engine 缓存中有哪些网站 /URL

如何查看 Cache Engine 缓存中有哪些网站 /URL 如何查看 Cache Engine 缓存中有哪些网站 /URL 目录 简介 硬件与软件版本 处理日志 验证配置 相关信息 简介 本文解释如何设置处理日志记录什么网站 /URL 在 Cache Engine 被缓存 硬件与软件版本 使用这些硬件和软件版本, 此配置开发并且测试了 : Hardware:Cisco 缓存引擎 500 系列和 73xx 软件 :Cisco Cache 软件版本 2.3.0

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

5. search engine marketing

5. search engine marketing 5. search engine marketing What s inside: A look at the industry known as search and the different types of search results: organic results and paid results. We lay the foundation with key terms and concepts

More information

CS6200 Information Retreival. Crawling. June 10, 2015

CS6200 Information Retreival. Crawling. June 10, 2015 CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on

More information

Cloak of Visibility. -Detecting When Machines Browse A Different Web. Zhe Zhao

Cloak of Visibility. -Detecting When Machines Browse A Different Web. Zhe Zhao Cloak of Visibility -Detecting When Machines Browse A Different Web Zhe Zhao Title: Cloak of Visibility -Detecting When Machines Browse A Different Web About Author: Google Researchers Publisher: IEEE

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 20: Crawling Hinrich Schütze Center for Information and Language Processing, University of Munich 2009.07.14 1/36 Outline 1 Recap

More information

deseo: Combating Search-Result Poisoning Yu USF

deseo: Combating Search-Result Poisoning Yu USF deseo: Combating Search-Result Poisoning Yu Jin @MSCS USF Your Google is not SAFE! SEO Poisoning - A new way to spread malware! Why choose SE? 22.4% of Google searches in the top 100 results > 50% for

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 17 September 2018 Some slides courtesy Manning, Raghavan, and Schütze Other characteristics Significant duplication Syntactic

More information

Ranking of ads. Sponsored Search

Ranking of ads. Sponsored Search Sponsored Search Ranking of ads Goto model: Rank according to how much advertiser pays Current model: Balance auction price and relevance Irrelevant ads (few click-throughs) Decrease opportunities for

More information

4. Backlink Analysis Check backlinks What Else? Analyze historical data... 29

4. Backlink Analysis Check backlinks What Else? Analyze historical data... 29 QUICK START Guide 1 Introduction... 3 1. Your Website s Performance... 4 Set up a project... 6 Track your keyword rankings... 6 Control your website s on-page health... 9 2. Competitive Intelligence...

More information

Brief (non-technical) history

Brief (non-technical) history Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

CS 345A Data Mining Lecture 1. Introduction to Web Mining

CS 345A Data Mining Lecture 1. Introduction to Web Mining CS 345A Data Mining Lecture 1 Introduction to Web Mining What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns Web Mining v. Data Mining Structure (or lack of

More information

Abhishek Dixit, Mukesh Agarwal

Abhishek Dixit, Mukesh Agarwal Hybrid Approach to Search Engine Optimization (SEO) Techniques Abhishek Dixit, Mukesh Agarwal First Author: Assistant Professor, Department of Computer Science & Engineering, JECRC, Jaipur, India Second

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Glossary of on line marketing terms

Glossary of on line marketing terms Glossary of on line marketing terms As more and more NCDC members become interested and involved in on line marketing, the demand for a deeper understanding of the terms used in the field is growing. To

More information

CS/INFO 1305 Information Retrieval

CS/INFO 1305 Information Retrieval (Search) Search Using a computer to find relevant pieces of information Text search Idea popularized in the article As We May Think by Vannevar Bush in 1945 Artificial Intelligence Where (or for what)

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Web Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University

Web Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University Web Search Basics Berlin Chen Department t of Computer Science & Information Engineering i National Taiwan Normal University References: 1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,

More information

3 Media Web. Understanding SEO WHITEPAPER

3 Media Web. Understanding SEO WHITEPAPER 3 Media Web WHITEPAPER WHITEPAPER In business, it s important to be in the right place at the right time. Online business is no different, but with Google searching more than 30 trillion web pages, 100

More information

Internet Lead Generation START with Your Own Web Site

Internet Lead Generation START with Your Own Web Site Internet Lead Generation START with Your Own Web Site Matt Johnston, Santa Barbara Business College Mike McHugh, PlattForm Career College Association 2007 What s s The Big Deal? More Control Higher Quality

More information

Why is Search Engine Optimisation (SEO) important?

Why is Search Engine Optimisation (SEO) important? Why is Search Engine Optimisation (SEO) important? With literally billions of searches conducted every month search engines have essentially become our gateway to the internet. Unfortunately getting yourself

More information

Bing.com scholar. Мобильный портал WAP версия: wap.altmaster.ru

Bing.com scholar. Мобильный портал WAP версия: wap.altmaster.ru Мобильный портал WAP версия: wap.altmaster.ru Bing.com scholar Aug 16 2011. I have already had several people ask me whether Bing offers something comparable to Google Scholar. Bing's alternative is Microsoft.

More information

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm Name: Student Id Number: 1. This is a closed book exam. 2. Please answer all questions. 3. There are a total of 40 questions.

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

seosummit seosummit April 24-26, 2017 Copyright 2017 Rebecca Gill & ithemes

seosummit seosummit April 24-26, 2017 Copyright 2017 Rebecca Gill & ithemes April 24-26, 2017 CLASSROOM EXERCISE #1 DEFINE YOUR SEO GOALS Template: SEO Goals.doc WHAT DOES SEARCH ENGINE OPTIMIZATION REALLY MEAN? Search engine optimization is often about making SMALL MODIFICATIONS

More information

Advertising Network Affiliate Marketing Algorithm Analytics Auto responder autoresponder Backlinks Blog

Advertising Network Affiliate Marketing Algorithm Analytics Auto responder autoresponder Backlinks Blog Advertising Network A group of websites where one advertiser controls all or a portion of the ads for all sites. A common example is the Google Search Network, which includes AOL, Amazon,Ask.com (formerly

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 19: Web Search Basics Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.07.07 Schütze: Web

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

ICP Enablon User Manual Factory ICP Enablon 用户手册 工厂 Version th Jul 2012 版本 年 7 月 16 日. Content 内容

ICP Enablon User Manual Factory ICP Enablon 用户手册 工厂 Version th Jul 2012 版本 年 7 月 16 日. Content 内容 Content 内容 A1 A2 A3 A4 A5 A6 A7 A8 A9 Login via ICTI CARE Website 通过 ICTI 关爱网站登录 Completing the Application Form 填写申请表 Application Form Created 创建的申请表 Receive Acknowledgement Email 接收确认电子邮件 Receive User

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 10: Introduction to Web Retrieval January 8 th, 2015 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig

More information

Promoting Website CS 4640 Programming Languages for Web Applications

Promoting Website CS 4640 Programming Languages for Web Applications Promoting Website CS 4640 Programming Languages for Web Applications [Jakob Nielsen and Hoa Loranger, Prioritizing Web Usability, Chapter 5] [Sean McManus, Web Design, Chapter 15] 1 Search Engine Optimization

More information

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson Web Crawling Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Robust Crawling A Robust Crawl Architecture DNS Doc.

More information

CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB

CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB CIS 601: Graduate Seminar Prof. S. S. Chung Presented By:- Amol Chaudhari CSU ID 2682329 AGENDA About Introduction Contributions Background

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Constructing Websites toward High Ranking Using Search Engine Optimization SEO

Constructing Websites toward High Ranking Using Search Engine Optimization SEO Constructing Websites toward High Ranking Using Search Engine Optimization SEO Pre-Publishing Paper Jasour Obeidat 1 Dr. Raed Hanandeh 2 Master Student CIS PhD in E-Business Middle East University of Jordan

More information

Mobile Travel Trends in China. Nov 2013

Mobile Travel Trends in China. Nov 2013 Mobile Travel Trends in China Nov 2013 Qunar is the world s largest Chinese travel platform Background Monthly Unique Visitors (in mm) Founded: 2005 Headquarters: Beijing, China Employees: 1699 Listed:

More information

Previous on Computer Networks Class 18. ICMP: Internet Control Message Protocol IP Protocol Actually a IP packet

Previous on Computer Networks Class 18. ICMP: Internet Control Message Protocol IP Protocol Actually a IP packet ICMP: Internet Control Message Protocol IP Protocol Actually a IP packet 前 4 个字节都是一样的 0 8 16 31 类型代码检验和 ( 这 4 个字节取决于 ICMP 报文的类型 ) ICMP 的数据部分 ( 长度取决于类型 ) ICMP 报文 首部 数据部分 IP 数据报 ICMP: Internet Control Message

More information

DP Project Development Pvt. Ltd.

DP Project Development Pvt. Ltd. Search Engine Optimization Training Syllabus Training that makes you focus on the correct business: Today's market is competitive and one has to be top in his field to make profits and stay in the business.

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

搜索引擎优化. Search Engine Optimization 赵卫东博士复旦大学软件学院

搜索引擎优化. Search Engine Optimization 赵卫东博士复旦大学软件学院 搜索引擎优化 Search Engine Optimization 赵卫东博士复旦大学软件学院 2009 10 23 It is not easy to design a good website? user perspective search engine Internet marketing search engine A web search engine has the following

More information

This presentation is copyrighted by ProSites, Inc. No part of this presentation can be copied, reproduced, displayed or changed without the express

This presentation is copyrighted by ProSites, Inc. No part of this presentation can be copied, reproduced, displayed or changed without the express This presentation is copyrighted by ProSites, Inc. No part of this presentation can be copied, reproduced, displayed or changed without the express written permission of ProSites, Inc. Logos or third party

More information

SEO and UAEX.EDU GETTING YOUR WEB PAGES FOUND IN GOOGLE

SEO and UAEX.EDU GETTING YOUR WEB PAGES FOUND IN GOOGLE SEO and UAEX.EDU GETTING YOUR WEB PAGES FOUND IN GOOGLE What is Search Engine Optimization? SEO is a marketing discipline focused on growing visibility in organic (non-paid) search engine results. Why

More information

云计算入门 Introduction to Cloud Computing GESC1001

云计算入门 Introduction to Cloud Computing GESC1001 Lecture #3 云计算入门 Introduction to Cloud Computing GESC1001 Philippe Fournier-Viger Professor School of Humanities and Social Sciences philfv8@yahoo.com Fall 2018 1 Course schedule Part 1 Part 2 Part 3 Introduction

More information

An Introduction to Search Engines and Web Navigation

An Introduction to Search Engines and Web Navigation An Introduction to Search Engines and Web Navigation MARK LEVENE ADDISON-WESLEY Ал imprint of Pearson Education Harlow, England London New York Boston San Francisco Toronto Sydney Tokyo Singapore Hong

More information

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0 Website Name Project Code: #10001 Version: 1.0 DocID: SEO/site/rec Issue Date: DD-MM-YYYY Prepared By: - Owned By: Rave Infosys Reviewed By: - Approved By: - 3111 N University Dr. #604 Coral Springs FL

More information

How to Get Your Website Listed on Major Search Engines

How to Get Your Website Listed on Major Search Engines Contents Introduction 1 Submitting via Global Forms 1 Preparing to Submit 2 Submitting to the Top 3 Search Engines 3 Paid Listings 4 Understanding META Tags 5 Adding META Tags to Your Web Site 5 Introduction

More information

THE HISTORY & EVOLUTION OF SEARCH

THE HISTORY & EVOLUTION OF SEARCH THE HISTORY & EVOLUTION OF SEARCH Duration : 1 Hour 30 Minutes Let s talk about The History Of Search Crawling & Indexing Crawlers / Spiders Datacenters Answer Machine Relevancy (200+ Factors)

More information

Outline. Motivations (1/3) Distributed File Systems. Motivations (3/3) Motivations (2/3)

Outline. Motivations (1/3) Distributed File Systems. Motivations (3/3) Motivations (2/3) Outline TFS: Tianwang File System -Performance Gain with Variable Chunk Size in GFS-like File Systems Authors: Zhifeng Yang, Qichen Tu, Kai Fan, Lei Zhu, Rishan Chen, Bo Peng Introduction (what s it all

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Microsoft RemoteFX: USB 和设备重定向 姓名 : 张天民 职务 : 高级讲师 公司 : 东方瑞通 ( 北京 ) 咨询服务有限公司

Microsoft RemoteFX: USB 和设备重定向 姓名 : 张天民 职务 : 高级讲师 公司 : 东方瑞通 ( 北京 ) 咨询服务有限公司 Microsoft RemoteFX: USB 和设备重定向 姓名 : 张天民 职务 : 高级讲师 公司 : 东方瑞通 ( 北京 ) 咨询服务有限公司 RemoteFX 中新的 USB 重定向特性 在 RDS 中所有设备重定向机制 VDI 部署场景讨论 : 瘦客户端和胖客户端 (Thin&Rich). 用户体验 : 演示使用新的 USB 重定向功能 81% 4 本地和远程的一致的体验 (Close

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per Information Retrieval Web Search Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!?

More information

Executed by Rocky Sir, tech Head Suven Consultants & Technology Pvt Ltd. seo.suven.net 1

Executed by Rocky Sir, tech Head Suven Consultants & Technology Pvt Ltd. seo.suven.net 1 Executed by Rocky Sir, tech Head Suven Consultants & Technology Pvt Ltd. seo.suven.net 1 1. Parts of a Search Engine Every search engine has the 3 basic parts: a crawler an index (or catalog) matching

More information

BROKERS MISSING THEIR CUSTOMERS Victor Lund Partner

BROKERS MISSING THEIR CUSTOMERS Victor Lund Partner BROKERS MISSING THEIR CUSTOMERS Victor Lund Partner 805-709-6696 victor@wavgroup.com http://waves.wavgroup.com Brokers Missing Consumers on Search If I ask a brokerage how their website ranks for top keywords

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Homework: Exercise 19. Homework: Exercise 21. Homework: Exercise 20. Homework: Exercise 22. Detour: Apache Lucene

Homework: Exercise 19. Homework: Exercise 21. Homework: Exercise 20. Homework: Exercise 22. Detour: Apache Lucene Homework: Exercise 19 Are the following statements true or false? Information Retrieval and Web Search Engines In a Boolean retrieval system, stemming never lowers precision Lecture 10: Introduction to

More information

High Quality Inbound Links For Your Website Success

High Quality Inbound Links For Your Website Success Axandra How To Get ö Benefit from tested linking strategies and get more targeted visitors. High Quality Inbound Links For Your Website Success How to: ü Ü Build high quality inbound links from related

More information

Search Engine Optimization

Search Engine Optimization Search Engine Optimization A necessary campaign for heightened corporate awareness What is SEO? Definition: The practice of building or transforming a Web site so that its content is seen as highly readable,

More information

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018 CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018 Lecture 6 Information Retrieval: Crawling & Indexing Aidan Hogan aidhog@gmail.com MANAGING TEXT DATA Information Overload If we didn t have search Contains

More information

Jargon Buster. Ad Network. Analytics or Web Analytics Tools. Avatar. App (Application) Blog. Banner Ad

Jargon Buster. Ad Network. Analytics or Web Analytics Tools. Avatar. App (Application) Blog. Banner Ad D I G I TA L M A R K E T I N G Jargon Buster Ad Network A platform connecting advertisers with publishers who want to host their ads. The advertiser pays the network every time an agreed event takes place,

More information

Basic Internet Skills

Basic Internet Skills The Internet might seem intimidating at first - a vast global communications network with billions of webpages. But in this lesson, we simplify and explain the basics about the Internet using a conversational

More information

Detecting Spam Web Pages

Detecting Spam Web Pages Detecting Spam Web Pages Marc Najork Microsoft Research Silicon Valley About me 1989-1993: UIUC (home of NCSA Mosaic) 1993-2001: Digital Equipment/Compaq Started working on web search in 1997 Mercator

More information

Multiprotocol Label Switching The future of IP Backbone Technology

Multiprotocol Label Switching The future of IP Backbone Technology Multiprotocol Label Switching The future of IP Backbone Technology Computer Network Architecture For Postgraduates Chen Zhenxiang School of Information Science and Technology. University of Jinan (c) Chen

More information

ONLINE EVALUATION FOR: Company Name

ONLINE EVALUATION FOR: Company Name ONLINE EVALUATION FOR: Company Name Address Phone URL media advertising design P.O. Box 2430 Issaquah, WA 98027 (800) 597-1686 platypuslocal.com SUMMARY A Thank You From Platypus: Thank you for purchasing

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

NBA 600: Day 15 Online Search 116 March Daniel Huttenlocher

NBA 600: Day 15 Online Search 116 March Daniel Huttenlocher NBA 600: Day 15 Online Search 116 March 2004 Daniel Huttenlocher Today s Class Finish up network effects topic from last week Searching, browsing, navigating Reading Beyond Google No longer available on

More information

ELEVATESEO. INTERNET TRAFFIC SALES TEAM PRODUCT INFOSHEETS. JUNE V1.0 WEBSITE RANKING STATS. Internet Traffic

ELEVATESEO. INTERNET TRAFFIC SALES TEAM PRODUCT INFOSHEETS. JUNE V1.0 WEBSITE RANKING STATS. Internet Traffic SALES TEAM PRODUCT INFOSHEETS. JUNE 2017. V1.0 1 INTERNET TRAFFIC Internet Traffic Most of your internet traffic will be provided from the major search engines. Social Media services and other referring

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 10: Introduction to Web Retrieval June 22, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig

More information