Search Engine Architecture. Search Engine Architecture

Similar documents
Search Engines. Information Retrieval in Practice

Information Retrieval

Chapter 2. Architecture of a Search Engine

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Information Retrieval Spring Web retrieval

Information Retrieval May 15. Web retrieval

CS6200 Information Retreival. Crawling. June 10, 2015

CS47300: Web Information Search and Management

Search Engine Architecture II

Crawling and Mining Web Sources

Search Engines Chapter 2 Architecture Felix Naumann

Information Retrieval

Information Retrieval

Natural Language Processing

CSE 5243 INTRO. TO DATA MINING

Crawling - part II. CS6200: Information Retrieval. Slides by: Jesse Anderton

Structured Retrieval Models and Query Languages. Query Language

CS47300: Web Information Search and Management

How Does a Search Engine Work? Part 1

Information Retrieval. Lecture 10 - Web crawling

Search Engines Information Retrieval in Practice

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Distributed Databases. Distributed Database Systems. CISC437/637, Lecture #21 Ben Cartere?e

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Chapter IR:II. II. Architecture of a Search Engine. Indexing Process Search Process

Collection Building on the Web. Basic Algorithm

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Information Retrieval

SEARCH ENGINE INSIDE OUT

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response

THE WEB SEARCH ENGINE

Search Engines. Charles Severance

Information Retrieval and Web Search

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Data Mining Learning from Large Data Sets

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

VK Multimedia Information Systems

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Chapter 6: Information Retrieval and Web Search. An introduction

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Searching the Deep Web

Building Your Blog Audience. Elise Bauer & Vanessa Fox BlogHer Conference Chicago July 27, 2007

Searching the Web What is this Page Known for? Luis De Alba

Utilizing Folksonomy: Similarity Metadata from the Del.icio.us System CS6125 Project

This document is for informational purposes only. PowerMapper Software makes no warranties, express or implied in this document.

Advisor/Committee Members Dr. Chris Pollett Dr. Mark Stamp Dr. Soon Tee Teoh. By Vijeth Patil

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Searching the Deep Web

Developing Focused Crawlers for Genre Specific Search Engines

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Instructor: Stefan Savev

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Module 1: Internet Basics for Web Development (II)

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

A Survey on Web Information Retrieval Technologies

Lecture No Image Enhancement in SpaPal Domain (course: Computer Vision)

Information Retrieval. (M&S Ch 15)

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Information Retrieval

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks

12. Web Spidering. These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

CrownPeak Playbook CrownPeak Search

Today we show how a search engine works

Full-Text Indexing For Heritrix

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Google Search Appliance

data analysis - basic steps Arend Hintze

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy

Part I: Data Mining Foundations

Internet Standards for the Web: Part II

EECS 395/495 Lecture 5: Web Crawlers. Doug Downey

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

Advanced Digital Marketing Course

Module 8: Search and Indexing

Below, we will walk through the three main elements of the algorithm, which include Domain Attributes, On-Page and Off-Page factors.

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

Mining Web Data. Lijun Zhang

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

Review of Wordpresskingdom.com

Exam IST 441 Spring 2011

Searching the Web for Information

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.

Transcription:

Search Engine Architecture CISC489/689 010, Lecture #2 Wednesday, Feb. 11 Ben CartereGe Search Engine Architecture A soiware architecture consists of soiware components, the interfaces provided by those components, and the relaponships between them describes a system at a parpcular level of abstracpon Architecture of a search engine determined by 2 requirements effecpveness (quality of results) and efficiency (response Pme and throughput) 1

Indexing Process Corpus Accessible data store Text acquisipon (Crawler, feeds, filter, ) Index creapon (Document/term stats, weighpng, inversion, ) Server(s) Text transformapon (Parsing, stopping, stemming, extracpon, ) Documents (E mails, web pages, Word docs, news arpcles, ) Indexing Process Text acquisipon idenpfies and stores documents for indexing Text transformapon transforms documents into index terms or features Index creapon takes index terms and creates data structures (indexes) to support fast searching 2

Query Process Corpus Accessible data store Ranking f(q,d) Server(s) EvaluaPon (Precision, recall, clicks, ) Query Process User interacpon supports creapon and refinement of query, display of results Ranking uses query and indexes to generate ranked list of documents EvaluaPon monitors and measures effecpveness and efficiency (primarily offline) 3

Details: Text AcquisiPon Crawler IdenPfies and acquires documents for search engine Many types web, enterprise, desktop Web crawlers follow links to find documents Must efficiently find huge numbers of web pages (coverage) and keep them up to date (freshness) Single site crawlers for site search Topical or focused crawlers for verpcal search Document crawlers for enterprise and desktop search Follow links and scan directories Text AcquisiPon Feeds Real Pme streams of documents e.g., web feeds for news, blogs, video, radio, tv RSS is common standard RSS reader can provide new XML documents to search engine Conversion Convert variety of documents into a consistent text plus metadata format e.g. HTML, XML, Word, PDF, etc. XML Convert text encoding for different languages Using a Unicode standard like UTF 8 4

Text AcquisiPon Document data store Stores text, metadata, and other related content for documents Metadata is informapon about document such as type and creapon date Other content includes links, anchor text Provides fast access to document contents for search engine components e.g. result list generapon Could use relaponal database system More typically, a simpler, more efficient storage system is used due to huge numbers of documents Text TransformaPon Parser Processing the sequence of text tokens in the document to recognize structural elements e.g., Ptles, links, headings, etc. Tokenizer recognizes words in the text must consider issues like capitalizapon, hyphens, apostrophes, non alpha characters, separators Markup languages such as HTML, XML oien used to specify structure Tags used to specify document elements E.g., <h2> Overview </h2> Document parser uses syntax of markup language (or other formanng) to idenpfy structure 5

Text TransformaPon Stopping Remove common words e.g., and, or, the, in Some impact on efficiency and effecpveness Can be a problem for some queries Stemming Group words derived from a common stem e.g., computer, computers, compupng, compute Usually effecpve, but not for all queries Benefits vary for different languages Text TransformaPon Link Analysis Makes use of links and anchor text in web pages Link analysis idenpfies popularity and community informapon e.g., PageRank Anchor text can significantly enhance the representapon of pages pointed to by links Significant impact on web search Less importance in other applicapons 6

Text TransformaPon InformaPon ExtracPon IdenPfy classes of index terms that are important for some applicapons e.g., named en;ty recognizers idenpfy classes such as people, loca;ons, companies, dates, etc. Classifier IdenPfies class related metadata for documents i.e., assigns labels to documents e.g., topics, reading levels, senpment, genre Use depends on applicapon Index CreaPon Document StaPsPcs Gathers counts and posipons of words and other features Used in ranking algorithm WeighPng Computes weights for index terms Used in ranking algorithm e.g., =.idf weight CombinaPon of term frequency in document and inverse document frequency in the collecpon 7

Index CreaPon Inversion Core of indexing process Converts document term informapon to termdocument for indexing Difficult for very large numbers of documents Format of inverted file is designed for fast query processing Must also handle updates Compression used for efficiency Index CreaPon Index DistribuPon Distributes indexes across mulpple computers and/or mulpple sites EssenPal for fast query processing with large numbers of documents Many variapons Document distribupon, term distribupon, replicapon P2P and distributed IR involve search across mulpple sites 8

User InteracPon Query input Provides interface and parser for query language Most web queries are very simple, other applicapons may use forms Query language used to describe more complex queries and results of query transformapon e.g., Boolean queries, Indri and Galago query languages similar to SQL language used in database applicapons IR query languages also allow content and structure specificapons, but focus on content User InteracPon Query transformapon Improves inipal query, both before and aier inipal search Includes text transformapon techniques used for documents Spell checking and query sugges;on provide alternapves to original query Query expansion and relevance feedback modify the original query with addiponal terms 9

User InteracPon Results output Constructs the display of ranked documents for a query Generates snippets to show how queries match documents Highlights important words and passages Retrieves appropriate adver;sing in many applicapons May provide clustering and other visualizapon tools Ranking Scoring Calculates scores for documents using a ranking algorithm Core component of search engine Basic form of score is q t and d t are query and document term weights for term t Many variapons of ranking algorithms and retrieval models 10

Ranking Performance oppmizapon Designing ranking algorithms for efficient processing Term at a ;me vs. document at a ;me processing Safe vs. unsafe oppmizapons DistribuPon Processing queries in a distributed environment Query broker distributes queries and assembles results Caching is a form of distributed searching EvaluaPon Logging Logging user queries and interacpon is crucial for improving search effecpveness and efficiency Query logs and clickthrough data used for query suggespon, spell checking, query caching, ranking, adverpsing search, and other components Ranking analysis Measuring and tuning ranking effecpveness Performance analysis Measuring and tuning system efficiency 11

How Does It Really Work? This course explains these components of a search engine in more detail OIen many possible approaches and techniques for a given component Focus is on the most important alternapves i.e., explain a small number of approaches in detail rather than many approaches Importance based on research results and use in actual search engines AlternaPves described in references Text AcquisiPon Web Crawling, Feeds, and Storage 12

Web Crawler Finds and downloads web pages automapcally provides the collecpon for searching Web is huge and constantly growing Web is not under the control of search engine providers Web pages are constantly changing Crawlers also used for other types of data Retrieving Web Pages Every page has a unique uniform resource locator (URL) Web pages are stored on web servers that use HTTP to exchange informapon with client soiware e.g., 13

Retrieving Web Pages Web crawler client program connects to a domain name system (DNS) server DNS server translates the hostname into an internet protocol (IP) address Crawler then agempts to connect to server host using specific port AIer connecpon, crawler sends an HTTP request to the web server to request a page usually a GET request Crawling the Web 14

Web Crawler Starts with a set of seeds, which are a set of URLs given to it as parameters Seeds are added to a URL request queue Crawler starts fetching pages from the request queue Downloaded pages are parsed to find link tags that might contain other useful URLs to fetch New URLs added to the crawler s request queue, or fron;er ConPnue unpl no more new URLs or disk full Web Crawling The long tail of web pages. UPlity Unordered crawling produces these. Number of pages 15

Web Crawling Ordering URLs Crawl URLs in some order of importance Random surfer model: A user starts on a page and randomly clicks links. Occasionally switches to a different page with no click. What is the probability the user will land on any given page? Higher probability greater importance. PageRank Web Crawling Web crawlers spend a lot of Pme waipng for responses to requests To reduce this inefficiency, web crawlers use threads and fetch hundreds of pages at once Crawlers could potenpally flood sites with requests for pages To avoid this problem, web crawlers use politeness policies e.g., delay between requests to same web server 16

Controlling Crawling Even crawling a site slowly will anger some web server administrators, who object to any copying of their data Robots.txt file can be used to control crawlers Simple Crawler Thread 17

Freshness Web pages are constantly being added, deleted, and modified Web crawler must conpnually revisit pages it has already crawled to see if they have changed in order to maintain the freshness of the document collecpon stale copies no longer reflect the real contents of the web pages Freshness HTTP protocol has a special request type called HEAD that makes it easy to check for page changes returns informapon about page, not page itself 18

Freshness Not possible to constantly check all pages must check important pages and pages that change frequently Freshness is the proporpon of pages that are fresh OpPmizing for this metric can lead to bad decisions, such as not crawling popular sites Age is a beger metric Age Expected age of a page t days aier it was last crawled: Web page updates follow the Poisson distribupon on average Pme unpl the next update is governed by an exponenpal distribupon 19

Age Older a page gets, the more it costs not to crawl it e.g., expected age with mean change frequency λ = 1/7 (one change per week) Focused Crawling AGempts to download only those pages that are about a parpcular topic used by ver;cal search applicapons Rely on the fact that pages about a topic tend to have links to other pages on the same topic popular pages for a topic are typically used as seeds Crawler uses text classifier to decide whether a page is on topic 20

Deep Web Sites that are difficult for a crawler to find are collecpvely referred to as the deep (or hidden) Web much larger than convenponal Web Three broad categories: private sites no incoming links, or may require log in with a valid account form results sites that can be reached only aier entering some data into a form scripted pages pages that use JavaScript, Flash, or another client side language to generate links Document Feeds Many documents are published created at a fixed Pme and rarely updated again e.g., news arpcles, blog posts, press releases, email Published documents from a single source can be ordered in a sequence called a document feed new documents found by examining the end of the feed 21

Document Feeds Two types: A push feed alerts the subscriber to new documents A pull feed requires the subscriber to check periodically for new documents Most common format for pull feeds is called RSS Really Simple SyndicaPon, RDF Site Summary, Rich Site Summary, or... Conversion Text is stored in hundreds of incompapble file formats e.g., raw text, RTF, HTML, XML, MicrosoI Word, ODF, PDF Other types of files also important e.g., PowerPoint, Excel Typically use a conversion tool converts the document content into a tagged text format such as HTML or XML retains some of the important formanng informapon 22

Character Encoding A character encoding is a mapping between bits and glyphs i.e., genng from bits in a file to characters on a screen Can be a major source of incompapbility ASCII is basic character encoding scheme for English encodes 128 legers, numbers, special characters, and control characters in 7 bits, extended with an extra bit for storage in bytes Character Encoding Other languages can have many more glyphs e.g., Chinese has more than 40,000 characters, with over 3,000 in common use Many languages have mulpple encoding schemes e.g., CJK (Chinese Japanese Korean) family of East Asian languages, Hindi, Arabic must specify encoding can t have mulpple languages in one file Unicode developed to address encoding problems 23

Storing the Documents Many reasons to store converted document text saves crawling Pme when page is not updated provides efficient access to text for snippet generapon, informapon extracpon, etc. Database systems can provide document storage for some applicapons web search engines use customized document storage systems Storing the Documents Requirements for document storage system: Random access request the content of a document based on its URL hash funcpon based on URL is typical Compression and large files reducing storage requirements and efficient access Update handling large volumes of new and modified documents adding new anchor text 24

Large Files Store many documents in large files, rather than each document in a file avoids overhead in opening and closing files reduces seek Pme relapve to read Pme Compound documents formats used to store mulpple documents in a file e.g., TREC Web, Wikipedia XML 25