Search Quality. Jan Pedersen 10 September 2007

Similar documents
An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

SEARCH ENGINE INSIDE OUT

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

OnCrawl Metrics. What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for.

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

PROJECT REPORT (Final Year Project ) Project Supervisor Mrs. Shikha Mehta

Search Engines. Information Retrieval in Practice

Today we show how a search engine works

Information Retrieval II

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Directory Search Engines Searching the Yahoo Directory

The Anatomy of a Large-Scale Hypertextual Web Search Engine

NBA 600: Day 15 Online Search 116 March Daniel Huttenlocher

Information Retrieval

How to Get Your Website Listed on Major Search Engines

Performance Analysis for Crawling

Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks

Mining Web Data. Lijun Zhang

Search Engine Optimization (SEO) & Your Online Success

Information Retrieval Spring Web retrieval

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

A Survey on Web Information Retrieval Technologies

THE HISTORY & EVOLUTION OF SEARCH

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Jan Pedersen 22 July 2010

THE WEB SEARCH ENGINE

CS47300: Web Information Search and Management

CS 347 Parallel and Distributed Data Processing

DEC Computer Technology LESSON 6: DATABASES AND WEB SEARCH ENGINES

CS 347 Parallel and Distributed Data Processing

Building an Internet-Scale Publish/Subscribe System

How Does a Search Engine Work? Part 1

Exam IST 441 Spring 2014

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Creating a Classifier for a Focused Web Crawler

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

5. search engine marketing

SEO. Definitions/Acronyms. Definitions/Acronyms

WWW and Web Browser. 6.1 Objectives In this chapter we will learn about:

You got a website. Now what?

USER SEARCH INTERFACES. Design and Application

Brief (non-technical) history

Accessibility of INGO FAST 1997 ARTVILLE, LLC. 32 Spring 2000 intelligence

BIG DATA TESTING: A UNIFIED VIEW

A Dynamic Bayesian Network Click Model for Web Search Ranking

Eagle Eye. Sommersemester 2017 Big Data Science Praktikum. Zhenyu Chen - Wentao Hua - Guoliang Xue - Bernhard Fabry - Daly

Search Engines. Information Technology and Social Life March 2, Ask difference between a search engine and a directory

Search Engine Techniques: A Review

Chapter 2. Architecture of a Search Engine

Information Retrieval

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

6 TOOLS FOR A COMPLETE MARKETING WORKFLOW

Towards a Distributed Web Search Engine

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Optimized Searching Algorithm Based On Page Ranking: (OSA PR)

power up your business SEO (SEARCH ENGINE OPTIMISATION)

Information Retrieval May 15. Web retrieval

Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines

Topology-Based Spam Avoidance in Large-Scale Web Crawls

Multimedia Streaming. Mike Zink

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Welcome to the class of Web Information Retrieval!

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans.

The Quickly Evolving World of Local SEO

Mining Web Data. Lijun Zhang

Information Retrieval. Lecture 9 - Web search basics

Distributed computing: index building and use

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Chapter IR:II. II. Architecture of a Search Engine. Indexing Process Search Process

Automatic Identification of User Goals in Web Search [WWW 05]

Part I: Data Mining Foundations

Top-To-Bottom (And Beyond) On-Page Optimization Guidebook

Web Search Strategy/Behavior. Meenu Sharma. Librarian Canadian Institute for International studies C-2, Phase-1, Industrial Area, Mohali

LIST OF ACRONYMS & ABBREVIATIONS

CS November 2017

CS November 2018

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Pay-Per-Click Advertising Special Report

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

5/13/2009. Introduction. Introduction. Introduction. Introduction. Introduction

Skill Area 209: Use Internet Technology. Software Application (SWA)

Searching in All the Right Places. How Is Information Organized? Chapter 5: Searching for Truth: Locating Information on the WWW

Complimentary SEO Analysis & Proposal. ageinplaceofne.com. Rashima Marjara

S-MART: Novel Tree-based Structured Learning Algorithms Applied to Tweet Entity Linking

1. Conduct an extensive Keyword Research

Text Technologies for Data Science INFR11145 Web Search Walid Magdy Lecture Objectives

A COMPARATIVE STUDY OF BYG SEARCH ENGINES

Web Search Algorithms - 1 -

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

doi: / _32

Extracting Information from Social Networks

Executive Summary. Performance Report for: The web should be fast. How does this affect me?

Information Retrieval and Web Search

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Executive Summary. Performance Report for: The web should be fast. Top 1 Priority Issues. How does this affect me?

Table of Contents. Introduction. Audience, Intent & Authority. Depth, Unique Value & User Experience. The Nuts & Bolts of SEO. Enhancing Your Content

Transcription:

Search Quality Jan Pedersen 10 September 2007

Outline The Search Landscape A Framework for Quality RCFP Search Engine Architecture Detailed Issues 2

Search Landscape 2007 Source: Search Engine Watch: US web search share, July 2006 Three major Mainframes Google,Yahoo, and MSN >800M searches daily 60% international 10 6 machines $20B in Paid Search Revenues Large indices Billions of documents Petabytes of data Key Drivers: Scale, Quality, Distribution 3

World Internet Usage 4

Search Results page Text Input Search Assists Related Results Search Ads Ranked Results 5

Search Engine Architecture Stream Computation WebMap Query Serving Crawling WWW Meta data Load Replication Index Bandwidth Bottleneck Snapshot Indexing Index CPU Bottleneck Index Index 6

Query Serving Architecture Rectangular Array Each row is a replicate Each column is an index segment Results are merged across segments Each node evaluates the query against its segment. Latency is determined by the performance of a single node Sunnyvale L3 FE 1 QI 1 travel F5 BigIP FE 2 FE 32 travel travel travel QI 2 QI 8 3DNS travel WI 1,1 WI 1,2 WI 1,3 WI 1,48 WI 2,1 WI 2,2 WI 2,3 WI 2,48 WI 3,1 WI 3,2 WI 3,3 WI 3,48 WI 4,1 WI 4,2 WI 4,3 WI 4,48 travel NYC L3 7

8

What s the Goal? User Satisfaction Understand user intent Problems: Ambiguity and Context Generate relevant matches Problems: Scale and accuracy Present useful information Problems: Ranking and Presentation 9

Evaluation Graded Relevance score Editorial Assessment Session/Task fulfillment? Behavioral measures? 10

Clickrate Relevance Metric Average highest rank clicked perceptibly increased with the release of a new rank function. 11

Quality Dimensions Ranking Ability to rank hits by relevance Comprehensiveness Index size and composition Freshness Recency of indexed data Presentation Titles and Abstracts 12

Comprehensiveness Problem: Make accessible all useful Web pages Issues: Web has an infinite number of pages Finite resources available Bandwidth Disk capacity Selection Problem Which pages to visit Crawl Policy Which pages to index Index Selection Policy 13

Moore s Law and Index Size ~150M in 1998 ~5B in 2005 33x increase Moore would predict 25x What about 2010? 40B? Source: Search Engine Watch 1994 Yahoo (directory) and Lycos (index) go public 1995 Infoseek and Excite go public 1997 Alta Vista launches 100M index 1998 Inktomi and Google launched 1999 All The Web launched 2003 Yahoo purchases Inktomi and Overture 2004 Google goes public 2005 Msft launches MS Live 14

Freshness Problem: Ensure that what is indexed correctly reflects current state of the web Impossible to achieve exactly Revisit vs Discovery Divide and Conquer A few pages change continually Most pages are relatively static 15

Changing documents in daily crawl for 32-day period 10000 1000 100 html file out-links 10 1 0 5 10 15 20 25 30 35 changes 16

Freshness Source: Search Engine Showdown 17

Ranking Problem: Given a well-formed query, place the most relevant pages in the first few positions Issues: Scale: Many candidate matches Response in < 100 msecs Evaluation: Editorial User Behavior 18

Ranking Framework Regression problem Estimate editorial relevance given ranking features Query Dependent features Term overlap between query and Meta-data Content Query Independent Features Quality (e.g. Page Rank) Spamminess 19

Machine Learned Ranking Goal: Automatically construct a ranking function Input: Large number training examples Features that predict relevance Relevance metrics Output: Ranking function Enables rapid experimental cycle Scientific investigation of Modifications to existing features New feature 20

Ranking Features A0 - A4 anchor text score per term W0 - W4 term weights L0 - L4 first occurrence location (encodes hostname and title match) SP spam index: logistic regression of 85 spam filter variables (against relevance scores) F0 - F4 term occurrence frequency within document DCLNdocument length (tokens) ER Eigenrank HB Extra-host unique inlink count ERHBER*HB A0W0 etc. A0*W0 QA Site factor logistic regression of 5 site link and url count ratios SPN Proximity FF family friendly rating UD url depth 21

Implements (Tree 0) A0w0 < 22.3 Y N L0 < 18+1 R=0.0015 L1 < 18+1 L1 < 509+1 R=-0.0545 W0 < 856 F0 < 2 + 1 R = -0.2368 F2 < 1 + 1 R=-0.0039 R=-0.0199 R=-0.1185 R=-0.1604 R=-0.0790 22

Presentation Spelling Correction Also Try Short cuts Titles and Abstracts 23

Eye Tracking Studies Golden Triangle Top left corner Quick scan For candidate Longer scan For relevance 24

Comparison to State-of-the-art 25

Conclusions Search is a hard problem Solutions are approximate Measurement is difficult Search quality can be decomposed in separate but related problems Ranking Comprehensiveness Freshness Presentation 26