An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Similar documents
Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Building Search Applications

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

DEC Computer Technology LESSON 6: DATABASES AND WEB SEARCH ENGINES

Search Quality. Jan Pedersen 10 September 2007

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

CS290N Summary Tao Yang

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

DATA MINING - 1DL105, 1DL111

Performance Analysis for Crawling

Information Retrieval

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Almost 80 percent of new site visits begin at search engines. A couple of years back Nielsen published a list of popular search engines.

Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Module 1: Internet Basics for Web Development (II)

Search & Google. Melissa Winstanley

THE HISTORY & EVOLUTION OF SEARCH

DATA MINING II - 1DL460. Spring 2014"

Information Retrieval Spring Web retrieval

Searching in All the Right Places. How Is Information Organized? Chapter 5: Searching for Truth: Locating Information on the WWW

Directory Search Engines Searching the Yahoo Directory

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Search Engines. Information Technology and Social Life March 2, Ask difference between a search engine and a directory

SEARCH ENGINE INSIDE OUT

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

Chapter 2. Architecture of a Search Engine

FAQ: Crawling, indexing & ranking(google Webmaster Help)

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Information Retrieval May 15. Web retrieval

Searching the Web for Information

CSE 3. How Is Information Organized? Searching in All the Right Places. Design of Hierarchies

Introduction to IR Systems: Supporting Boolean Text Search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

CS 525: Advanced Database Organization 04: Indexing

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

A Survey on Web Information Retrieval Technologies

Google Inc. The world s leading Internet search engine. MarketLine Case Study. Reference Code: ML Publication Date: March 2012

Your Website as a Marketing Tool. Randy L. Martin R. L. Martin and Associates

Using the Penn State Search Engine

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Brief (non-technical) history

BUbiNG. Massive Crawling for the Masses. Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna

PROJECT REPORT (Final Year Project ) Project Supervisor Mrs. Shikha Mehta

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Text Technologies for Data Science INFR11145 Web Search Walid Magdy Lecture Objectives

CISC 7610 Lecture 2b The beginnings of NoSQL

Microsoft FAST Search Server 2010 for SharePoint for Application Developers Course 10806A; 3 Days, Instructor-led

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Guest Lecture. Daniel Dao & Nick Buroojy

Information Retrieval

Search Engines. Charles Severance

Search Engine Architecture II

Searching and Ranking

How to Drive More Traffic to Your Website in By: Greg Kristan

Web Search. Web Spidering. Introduction

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Objective Explain concepts used to create websites.

Full-Text Indexing For Heritrix

MG4J: Managing Gigabytes for Java. MG4J - intro 1

CS47300 Web Information Search and Management

How To Construct A Keyword Strategy?

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy

Part I: Data Mining Foundations

Provided by TryEngineering.org -

THE WEB SEARCH ENGINE

Making your agency s sites more accessible to web search engine users. Implementing the Sitemap protocol

Distributed Web Crawling over DHTs. Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4

Around the Web in Six Weeks: Documenting a Large-Scale Crawl

CS 245: Database System Principles

Search Engine Architecture. Hongning Wang

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

An Introduction to Search Engines and Web Navigation

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Information Retrieval II

Today we shall be starting discussion on search engines and web crawler.

Searching the Deep Web

ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH

Camp Williams Utah Military Academy. Canvas Parent s Guide

Introduction. Can we use Google for networking research?

DATA MINING II - 1DL460. Spring 2017

Topology-Based Spam Avoidance in Large-Scale Web Crawls

Chapter 6: Information Retrieval and Web Search. An introduction

Search Engines. Information Retrieval in Practice

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

You got a website. Now what?

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

NeighborWatcher: A Content-Agnostic Comment Spam Inference System

Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

SE Workshop PLAN. What is a Search Engine? Components of a SE. Crawler-Based Search Engines. How Search Engines (SEs) Work?

Toward Human-Computer Information Retrieval

Transcription:

An Overview of Search Engine Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia haixu@microsoft.com July 24, 2007 1

Outline History of Search Engine Difference Between Software and Service Architecture of Search Engine 5 Tips On Optimizing Search Engine 3 Secrets On Implementing Search Engine 2

History of Search Engine Personal or Academic Site(1993-1996) Internet Portal(1996-1999) Technology Provider(1999-2002) Search Portal(2002-) 3

Personal or Academic Site(1993-1996) Archie WebCrawler Lycos Excite Yahoo 4

Internet Portal (1996-1999) Yahoo! Lycos Excite Infoseek 5

Technology Provider (1999-2002) AltaVista Inkotomi Fast/AlltheWeb Google Goto/Overture 6

Search Portal (2002-) Google Yahoo MSN ASK 7

Lessons From The Past Technology is the biggest challenge Search engine always is an important application of Internet Search engine can always be developed better 8

Architecture of Search Engine URL DB Crawler Page DB Query Result Page Search 9

Difference Between Software and Service Product vs. Experience Feature vs. Refinement Develop vs. Operate Release vs. Serve Code vs. Parameter Update vs. Tune Bug Free vs. Optimal 10

Crawler Crawling is more difficult than what you think Stability in downloading, computation and storage Scalability High Performance 11

Performance Content Analysis tf*idf html tag html visual information Link Analysis PageRank, Spam 12

Search Huge Traffic Huge Data Complicated Computation Very Large Server Cluster 13

Engineering Problem of Search Engine Intellectual Problem Optimization Non-intellectual Problem Implementation 14

5 Tips On Optimizing Search Engine Define problem from user perspective System-level thinking Feature is more important than classification method Tradeoff Combine several simple solutions to a powerful solution 15

Non-intellectual Problem Architecture High Performance 16

3 Secrets On Implementing Search Engine Cache Signature Hash Table 17

Cache What is cache What to do with cache is more important than how to cache Search result page cache 18

Cache (Cont.) Front-end Cache & Back-end Cache Caching Merged & Caching Raw Caching & Caching Display Information Caching everything 19

Cache (cont.) Cache them before search Cache In Disk 2 Terms Cache 20

Signature What is signature What problems can benefit from signature trick Signature algorithm Probability of conflict 21

Hash Table What is hash table Implementation of hash table using signature 22

Document is an by doc is an by term just is a hash table of terms 23

Query Statistics Problem: From query log file, we want to get frequency of each query Solution 1: Sort query log file, then count each query Solution 2: Hash table 24

O(n) Sort Problem: Sort student record according to examination score Solution 1: qsort(), O(n*logn) Solution 2: Hash table, O(n) 25

De-duplicate URLs Problem: De-duplicate URLs with same content Solution 1: Sort, then compare Solution 2: Hash table 26

Set Operations A*B A+B A-B B-A 27

Page Storage Architecture Crawl Page DB Crawl Page Distributor Page DB Page DB Page DB Page DB 28

Architecture Page DB 29

Search Architecture Load Balancer Query Result Page Frontend Frontend Frontend 30

Crawling Architecture How about this solution: Url DB Url DB Url DB Url DB Url DB URL Url DB Crawl Crawl Crawl Crawl Crawl Crawl Page DB WRONG! 31

Crawling Architecture (Cont.) Url DB Url DB Url DB Url DB URL Url DB Crawl Crawl Crawl Crawl Crawl Page DB URL Distributor 32

Summary History of Search Engine Difference Between Software and Service Architecture of Search Engine 5 Tips On Optimizing Search Engine 3 Secrets On Implementing Search Engine 33

34 Thank you!

35 Q&A