Search Engine Architecture. Hongning Wang

Similar documents
Retrieval Evaluation. Hongning Wang

Introduction to Information Retrieval. Hongning Wang

Boolean Model. Hongning Wang

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Modern Retrieval Evaluations. Hongning Wang

Dynamic Visualization of Hubs and Authorities during Web Search

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Module 1: Internet Basics for Web Development (II)

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Abstract. 1. Introduction

Searching the Web for Information

Searching in All the Right Places. How Is Information Organized? Chapter 5: Searching for Truth: Locating Information on the WWW

Social Search Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Optimizing Search Engines using Click-through Data

CSE 3. How Is Information Organized? Searching in All the Right Places. Design of Hierarchies

Introduction to Text Mining. Hongning Wang

Link Analysis. Hongning Wang

CS54701: Information Retrieval

A brief history of Google

CHECKPOINT CATALYST QUICK REFERENCE CARD

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

Information Retrieval

Information Retrieval

Searching the Deep Web

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

A GEOGRAPHICAL LOCATION INFLUENCED PAGE RANKING TECHNIQUE FOR INFORMATION RETRIEVAL IN SEARCH ENGINE

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL

CS47300: Web Information Search and Management

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

THE WEB SEARCH ENGINE

Searching the Deep Web

KDD 10 Tutorial: Recommender Problems for Web Applications. Deepak Agarwal and Bee-Chung Chen Yahoo! Research

An Adaptive Approach in Web Search Algorithm

Chapter 27 Introduction to Information Retrieval and Web Search

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

CS 572: Information Retrieval. Lecture 1: Course Overview and Introduction 11 January 2016

PAGE RANK ON MAP- REDUCE PARADIGM

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Aggregation for searching complex information spaces. Mounia Lalmas

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Efficient World-Wide-Web Information Gathering. Tian Fanjiang Wang Xidong Wang Dingxing

Introduction & Administrivia

Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks

An Approach To Improve Website Ranking Using Social Networking Site

International Journal of Computer Engineering and Applications, Volume IX, Issue X, Oct. 15 ISSN

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Effective Page Refresh Policies for Web Crawlers

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Interface. Dispatcher. Meta Searcher. Index DataBase. Parser & Indexer. Ranker

Open-Corpus Adaptive Hypermedia. Adaptive Hypermedia

NBA 600: Day 15 Online Search 116 March Daniel Huttenlocher

COMP Page Rank

Human-Computer Information Retrieval

INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects

Open-Corpus Adaptive Hypermedia. Peter Brusilovsky School of Information Sciences University of Pittsburgh, USA

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

HOW BUILDING OUR OWN E-COMMERCE SEARCH CHANGED OUR APPROACH TO SEARCH QUALITY. 2018// Berlin

INFSCI 2480 Adaptive Information Systems Adaptive [Web] Search. Peter Brusilovsky.

21. Search Models and UIs for IR

Searching. Outline. Copyright 2006 Haim Levkowitz. Copyright 2006 Haim Levkowitz

Advanced Computer Graphics CS 525M: Crowds replace Experts: Building Better Location-based Services using Mobile Social Network Interactions

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Chapter IR:II. II. Architecture of a Search Engine. Indexing Process Search Process

Google Scale Data Management

Chapter 2. Architecture of a Search Engine

Web Crawling and Basic Text Analysis. Hongning Wang

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Word Disambiguation in Web Search

Search guide New interface

Information Retrieval CSCI

IALP 2016 Improving the Effectiveness of POI Search by Associated Information Summarization

Research and Design of Key Technology of Vertical Search Engine for Educational Resources

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

TREC-10 Web Track Experiments at MSRA

INFSCI 2140 Information Storage and Retrieval Lecture 6: Taking User into Account. Ad-hoc IR in text-oriented DS

CS 6320 Natural Language Processing

Recent Researches on Web Page Ranking

Software agent for locating and analyzing virtual communities on the world wide web

Assignment 1. Assignment 2. Relevance. Performance Evaluation. Retrieval System Evaluation. Evaluate an IR system

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Reading Time: A Method for Improving the Ranking Scores of Web Pages

Session 10: Information Retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

Visual Guide to Online Campus

A New Context Based Indexing in Search Engines Using Binary Search Tree

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Performance Evaluation of a Regular Expression Crawler and Indexer

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Transcription:

Search Engine Architecture Hongning Wang CS@UVa

CS@UVa CS4501: Information Retrieval 2 Document Analyzer Classical search engine architecture The Anatomy of a Large-Scale Hypertextual Web Search Engine - Sergey Brin and Lawrence Page, Computer networks and ISDN systems 30.1 (1998): 107-117. Crawler and indexer Query parser Citation count: 16636 (as of August 29, 2017) Ranking model

User input Result display Query parser Result postprocessing Ranking model Crawler & Indexer Domain specific database Document analyzer & auxiliary database CS@UVa CS4501: Information Retrieval 3

Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer Index Ranker results CS@UVa CS4501: Information Retrieval 4

Core IR concepts Information need an individual or group's desire to locate and obtain information to satisfy a conscious or unconscious need wiki An IR system is to satisfy users information need Query A designed representation of users information need In natural language, or some managed form CS@UVa CS4501: Information Retrieval 5

Core IR concepts Document A representation of information that potentially satisfies users information need Text, image, One sentence video, audio, about and IR etc. - rank Relevance documents by their relevance to the user s information need Relatedness between documents and users information need Multiple perspectives: topical, semantic, temporal, spatial, and etc. CS@UVa CS4501: Information Retrieval 6

Key components in a search engine Web crawler An automatic program that systematically browses the web for the purpose of Web content indexing and updating Document analyzer & indexer Manage the crawled web content and provide efficient access of web documents CS@UVa CS4501: Information Retrieval 7

Key components in a search engine Query parser Compile user-input keyword queries into managed system representation Ranking model Sort candidate documents according to it relevance to the given query Result display Present the retrieved results to users for satisfying their information need CS@UVa CS4501: Information Retrieval 8

Key components in a search engine Retrieval evaluation Assess the quality of the returned results Relevance feedback Propagate the quality judgment back to the system for search result refinement CS@UVa CS4501: Information Retrieval 9

Key components in a search engine Search query logs Record users interaction history with search engine User modeling Understand users longitudinal information need Assess users satisfaction towards search engine output CS@UVa CS4501: Information Retrieval 10

Discussion: Browsing v.s. Querying Browsing what Yahoo did before The system organizes information with structures, and a user navigates into relevant information by following a path enabled by the structures Works well when the user wants to explore information or doesn t know what keywords to use, or can t conveniently enter a query (e.g., with a smartphone) Querying what Google does A user enters a (keyword) query, and the system returns a set of relevant documents Works well when the user knows exactly what query to use for expressing her information need CS@UVa CS4501: Information Retrieval 11

Recap: core IR concepts Information need an individual or group's desire to locate and obtain information to satisfy a conscious or unconscious need wiki An IR system is to satisfy users information need Query A designed representation of users information need In natural language, or some managed form CS@UVa CS4501: Information Retrieval 12

Recap: core IR concepts Document A representation of information that potentially satisfies users information need Text, image, video, audio, and etc. Relevance Relatedness between documents and users information need Multiple perspectives: topical, semantic, temporal, spatial, and etc. CS@UVa CS4501: Information Retrieval 13

Recap: Browsing v.s. Querying Browsing what Yahoo did before The system organizes information with structures, and a user navigates into relevant information by following a path enabled by the structures Works well when the user wants to explore information or doesn t know what keywords to use, or can t conveniently enter a query (e.g., with a smartphone) Querying what Google does A user enters a (keyword) query, and the system returns a set of relevant documents Works well when the user knows exactly what query to use for expressing her information need CS@UVa CS4501: Information Retrieval 14

Pull vs. Push in Information Retrieval Pull mode with query Users take initiative and pull relevant information out from a retrieval system Works well when a user has an ad hoc information need Push mode without query Systems take initiative and push relevant information to users Works well when a user has a stable information need or the system has good knowledge about a user s need CS@UVa CS4501: Information Retrieval 15

Discussion: is Yelp a search engine? CS@UVa CS4501: Information Retrieval 16

What you should know Basic workflow and components in an IR system Core concepts in IR Browsing v.s. querying Pull v.s. push of information CS@UVa CS4501: Information Retrieval 17

Today s reading Introduction to Information Retrieval Chapter 19: Web search basics CS@UVa CS4501: Information Retrieval 18