Information Retrieval May 15. Web retrieval

Similar documents
Information Retrieval Spring Web retrieval

Search Engines. Information Retrieval in Practice

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

CS47300: Web Information Search and Management

CS6200 Information Retreival. Crawling. June 10, 2015

CS47300: Web Information Search and Management

Search Engine Architecture. Search Engine Architecture

Information Retrieval

Crawling - part II. CS6200: Information Retrieval. Slides by: Jesse Anderton

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

How Does a Search Engine Work? Part 1

DATA MINING II - 1DL460. Spring 2014"

Search Engines. Charles Severance

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Crawling and Mining Web Sources

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

CS47300 Web Information Search and Management

Mining Web Data. Lijun Zhang

Searching the Web What is this Page Known for? Luis De Alba

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Information Retrieval

Information Retrieval and Web Search

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Collection Building on the Web. Basic Algorithm

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

power up your business SEO (SEARCH ENGINE OPTIMISATION)

How to Drive More Traffic to Your Website in By: Greg Kristan

SEO. Definitions/Acronyms. Definitions/Acronyms

OnCrawl Metrics. What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for.

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

DATA MINING II - 1DL460. Spring 2017

Search Engines Information Retrieval in Practice

Site Audit SpaceX

Unit 4 The Web. Computer Concepts Unit Contents. 4 Web Overview. 4 Section A: Web Basics. 4 Evolution

SEO 1 8 O C T O B E R 1 7

Why it Really Matters to RESNET Members

THE HISTORY & EVOLUTION OF SEARCH

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Objective Explain concepts used to create websites.

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

6 WAYS Google s First Page

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Introduction to Information Retrieval

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans.

FAQ: Crawling, indexing & ranking(google Webmaster Help)

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Informa(on Retrieval

CS 345A Data Mining Lecture 1. Introduction to Web Mining

Chapter 2: Literature Review

Mining Web Data. Lijun Zhang

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

Information Retrieval. Lecture 10 - Web crawling

Site Audit Boeing

URLs excluded by REP may still appear in a search engine index.

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

SEO According to Google

DATA MINING - 1DL105, 1DL111

Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks

High Quality Inbound Links For Your Website Success

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

Next Level Marketing Online techniques to grow your business Hudson Digital

Search Engine Technology. Mansooreh Jalalyazdi

World Wide Web has specific challenges and opportunities

Information Retrieval II

Searching in All the Right Places. How Is Information Organized? Chapter 5: Searching for Truth: Locating Information on the WWW

What Is Voice SEO and Why Should My Site Be Optimized For Voice Search?

Advanced Digital Marketing Course

AN SEO GUIDE FOR SALONS

Web Applications: Internet Search and Digital Preservation

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

Site Audit Virgin Galactic

TOP RANKING FACTORS A QUICK OVERVIEW OF THE TRENDING TOP RANKING FACTORS

A Survey on Web Information Retrieval Technologies

The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? A-Z of Digital Marketing Translation

Why is Search Engine Optimisation (SEO) important?

Constructing Websites toward High Ranking Using Search Engine Optimization SEO

Advertising Network Affiliate Marketing Algorithm Analytics Auto responder autoresponder Backlinks Blog

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Part 1: Link Analysis & Page Rank

deseo: Combating Search-Result Poisoning Yu USF

Link Analysis in Web Mining

Creating a Classifier for a Focused Web Crawler

Digital Marketing. Introduction of Marketing. Introductions

Introduction to Information Retrieval

Glossary of on line marketing terms

CSE 3. How Is Information Organized? Searching in All the Right Places. Design of Hierarchies

SE Workshop PLAN. What is a Search Engine? Components of a SE. Crawler-Based Search Engines. How Search Engines (SEs) Work?

SEARCH ENGINE INSIDE OUT

WebReach Product Glossary

Internet Lead Generation START with Your Own Web Site

A Guide to Improving Your SEO

Executed by Rocky Sir, tech Head Suven Consultants & Technology Pvt Ltd. seo.suven.net 1

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

Transcription:

Information Retrieval May 15 Web retrieval

What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement

How big is the Web? Practically infinite due to the dynamic pages. The host count - more than 1 billion (1,010,251,829) computers in the Internet (Internet Domain Survey January 2014) The Indexed Web contains at least 4.62 billion pages (worldwidewebsize.com/ October, 2014). Due to the growth rate, any estimation is immediately wrong.

How big is the Web? The web is huge nobody knows how big it is, but what we do know is that the part of it that is reached and indexed by search engines is just the surface. Most of the web is buried deep down in dynamically generated web pages, pages that are not linked to by other pages and sites that require logins which are not reached by these engines. Most experts think that this deep (hidden) web is several orders of magnitude larger than the 2.3 billion pages that we can see. John Naughton The Observer, Sunday 9 March 2014

Web search engines Google Bing Yahoo! Baidu Chinese Yandex Russian DuckDuckGo same results for all users

Challenges Volume and distribution of data; pace of change How to find pages to index Quality and authoritativeness of documents How do you know that you can rely on what you find Expressing queries and interpreting results How to formulate queries? Most users have no education in search Interpreting queries and ranking results fast Efficient search based on poorly formulated ambiguous queries, in a very large repository

Variety of content Public anyone can publish Many formats: HTML. GIF, JPEG, ASCII text and PDF. Many languages on the Web Quality

Conversion Text is stored in hundreds of incompatible file formats e.g., raw text, RTF, HTML, XML, Microsoft Word, ODF, PDF, PowerPoint, Excel A conversion tool converts the document content into a tagged text format such as HTML or XML retains some of the important formatting information

Web page spam Spam Link spam: artificially increasing the link based scores of Web pages. Click spam is done by robots which specify queries and click on preselected pages or ads Term spam: artificially increasing term frequency based scores

Search Engine Optimization Some people often confuse Web spam with Search Engine Optimization (SEO) SEO are techniques to improve the description of the contents of a Web page proper and better descriptions improve the odds that the page will be ranked higher these are legitimate techniques, particularly when they follow the guidelines published by most search engines in contrast, malicious SEO is used by Web spammers who want to deceive users and search engines alike

Advertisement Advertising is the search engines main source of revenue. Contextual advertising Sponsored search Content match Key word bids

Web search Classical IR: differences Different content production anyone can procude Web pages Mass and heterogeneity of the content Different users: many non-professionals! Varying types of search goals: informational, navigational and transactional queries

The Web graph Directed Graph Pages: nodes Links: edges Not strongly connected In-links and out-links Average number of in-links 8-15 Not randomly distributed

Crawling Finding and downloading Web pages automatically. Crawler or Spider Web, topical/focused, enterprise Challenges Volume and pace of change No control over the pages that are to be copied Deep Web, Politeness & Privacy. Two tasks: Downloading pages Finding URLs

Web Crawler Starts with a set of seeds, which are a set of URLs given to it as parameters Seeds are added to a URL request queue Crawler starts fetching pages from the request queue Downloaded pages are parsed to find link tags that might contain other useful URLs to fetch New URLs added to the crawler s request queue, or frontier Continue until no more new URLs or disk full

Crawling the Web

Web Crawling Web crawlers spend a lot of time waiting for responses to requests To reduce this inefficiency, web crawlers use threads and fetch hundreds of pages at once Crawlers could potentially flood sites with requests for pages To avoid this problem, web crawlers use politeness policies e.g., delay between requests to same web server

Politeness policies To avoid taking up all the resources of a web server. Fetch only one page at a time from a server. Delay between requests to the same server. Request queues split into one queue per web server; most queues off limits at any one time. Very large queue required Web sites can permit or disallow crawling the site or parts of it.

Controlling Crawling Even crawling a site slowly will anger some web server administrators, who object to any copying of their data Robots.txt file can be used to control crawlers

Simple Crawler Thread

Focused Crawling Attempts to download only those pages that are about a particular topic used by vertical search applications Pages about a topic tend to have links to other pages on the same topic popular pages for a topic are typically used as seeds Crawler uses text classifier to decide whether a page is on topic

Deep Web Sites that are difficult for a crawler to find are collectively referred to as the deep (or hidden) Web much larger than conventional Web Three broad categories: private sites no incoming links, or may require log in with a valid account form results sites that can be reached only after entering some data into a form scripted pages pages that use JavaScript, Flash, or another client-side language to generate links

Distributed Crawling Three reasons to use multiple computers for crawling Helps to put the crawler closer to the sites it crawls Reduces the number of sites the crawler has to remember Reduces computing resources required

Storing the Documents Reasons to store converted document text saves crawling time when page is not updated provides efficient access to text for snippet generation, information extraction, etc. Store many documents in large files, rather than each document in a file avoids overhead in opening and closing files reduces seek time relative to read time Compound documents formats used to store multiple documents in a file e.g., TREC Web

Conversion and Storage The collected documents in rarely plain text. HTML, XML, PDF, Office, RTF, txt Needs to be converted to uniform text + metadata Character coding Document data store Text + structured data Needed for fast access (snippets); information extraction; saving processing cost and network load. Snippets unique to each query created dynamically

TREC Web Format

Indexes Inverted indexes Distributed due to Size Costs Efficient query processing Hierarchical A small first level index for the most common queries. A larger and slower index for the rest of the queries Dynamic: merging indexes or merging results

Freshness Web pages are constantly being added, deleted, and modified Web crawler must continually revisit pages it has already crawled to see if they have changed in order to maintain the freshness of the document collection stale copies no longer reflect the real contents of the web pages

Freshness HTTP protocol has a special request type called HEAD that makes it easy to check for page changes returns information about page, not page itself

Freshness Not possible to constantly check all pages must check important pages and pages that change frequently Freshness is the proportion of pages that are fresh Optimizing for this metric can lead to bad decisions, such as not crawling popular sites Age is a better metric

Age Expected age of a page t days after it was last crawled: Web page updates follow the Poisson distribution on average time until the next update is governed by an exponential distribution

Freshness vs. Age

Sitemaps Sitemaps contain lists of URLs and data about those URLs, such as modification time and modification frequency Generated by web server administrators Tells crawler about pages it might not otherwise find Gives crawler a hint about when to check a page for changes

Sitemap Example

Removing Duplicates and Noise Duplicate and near-duplicate documents occur in many situations Copies, versions, plagiarism, spam, mirror sites 30% of the web pages in a large crawl are exact or near duplicates of pages in the other 70% Duplicates consume significant resources during crawling, indexing, and search Little value to most users Noise Text, links and pictures that are not related to the central content of the document Negative effect on ranking

Finding Content Blocks Cumulative distribution of tags in the example web page Main text content of the page corresponds to the plateau in the middle of the distribution

Link extraction and analysis Links and anchor texts are extracted from the documents and stored into the document data store - with the destination pages. Used for calculating scores that are based on the link structure of the web. Anchor texts are concise topical representations of the destination document. Anchor information may be indexed even for pages not yet crawled

Caching Search engines need to be fast. Client side (browsers) and server side (search engine). Popular queries account for 50 % of queries. Caching answers About half of the queries are still unique Caching inverted lists of the index

Search and result presentation Number of results is potentially very large. Number of results shown to a user is very small. Basic Architecture Given a query 10 results shown are subset of complete result set if user requests more results, search engine can - recompute the query to generate the next 10 results - obtain them from a partial result set maintained in main memory In any case, a search engine never computes the full answer set for the whole Web

Ranking for Web Search Ranking based on topicality and quality Topicality: Language models, Quality/popularity/authority: Page Rank, Hubs and authorities Hubs are pages with many outlinks Authorities are pages with many inlinks

Challenge for ranking Identification of quality content in the Web Evidence of quality can be indicated by signals such as: - domain names - text content - links (like PageRank) Additional useful signals are provided by the layout of the Web page, its title, metadata, font sizes, etc.

Other challenges avoiding, preventing, managing Web spam - spammers are malicious users who try to trick search engines by artificially inflating signals used for ranking - a consequence of the economic incentives of the current advertising model adopted by search engines defining the ranking function and computing it

Ranking signals Signals of topicality: text content Simple word counts Full ranking algorithms such as BM25. Anchor texts Layout: titles, headings, Signals of quality Domain names Number of in-links and out-links Clicks Other: Page metadata; geographical location; language; query history; Avoiding spam spam spam

Link-based ranking Anchor text Number of in-links: indications of popularity and quality Shared links: indications of relations between pages Hubs and authorities

PageRank The basic idea is that good pages point to good pages Random walk through the Web. Random surfer wandering aimlessly between Web pages. Clicks randomly one of the links on a page, or a surprise me button. Continues browsing like this for a very long time. Eventually, the random surfer has visit every single Web page The popular pages much more often, due to following links The outlinks from popular pages influence the path much more than from less popular pages. The probability of viewing a page at any given moment is the PageRank of that page.

Evaluation Monitoring ranking quality Use of standard precision-recall metrics Precision of Web results should be measured only at the top positions in the ranking, say P@5, P@10, and P@20 Based on human judgement or click-through data. click-through works well in large corpora. Clicks, dwell time,

Spam SPAM: repetitive, annoying behaviour? Where did the word come from? http://www.youtube.com/watch?v=anwy2mpt5 RE