A Survey on Web Information Retrieval Technologies
|
|
- Jodie Austin
- 6 years ago
- Views:
Transcription
1 A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University
2 Overview Web Information Retrieval Challenges Search Engine Overview and Architecture (Google as Case Study) Various Algorithms Directories Overview and some Algos. Estimating Size of Web Main Challenges Google File System Sponsored Search Discussion
3 Web Information Retrieval Challenges Bulk Dynamic Internet Heterogeneity Variety of Languages Duplication High Linkage Ill-formed queries Wide Variance in Users Specific Behavior
4 The Goal
5 Evaluation Precision - fraction of the documents retrieved that are relevant to the user's information need. Precision = (relevant documents AND retrieved documents)/retrieved documents Recall - fraction of the documents that are relevant to the query that are successfully retrieved. Recall = (relevant documents AND retrieved documents)/relevant documents
6 Current Goal Precision at Top 0 results & Precision at Top 0 result pages
7 Various SEs SEs Directories News SEs Meta Search Engines Social Search Engines Opinions, Forums, Usenets SEs
8 Architecture (Sherman 003) Crawler URL Indexer The Web URL3 Query Server URL Eggs? URL4 Eggs. All About Eggs - 90% Eggo - 8% Your Eggs EgoBrowser by40% Huh? 0% S. I. -Am
9 Crawling the web Robust should avoid overloading the websites and must deal with huge amounts of data. Decides in what order to crawl the pages. Frequency of revisiting pages Rule of Thumb Important pages first
10 Crawler
11 Cho etc.99 (spread the workload) Allocation that URL s in 500 Queues Allocation based on the Hash of the server name Read one URL from each queue at a time
12 Inverted Indexes the IR Way How Inverted Index Are Created???
13 Periodically rebuilt, static otherwise. Documents are parsed to extract tokens. These are saved with the Document ID. Doc Doc Now is the time It was a dark and for all good men stormy night in the country to come to the aid of their country manor. The time was past midnight Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc #
14 After all documents have been parsed the inverted file is sorted alphabetically. Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc # Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc #
15 Multiple term entries for a single document are merged. Withindocument term frequency information is compiled. Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Freq
16 How Inverted Files are Created Finally, the file can be split into Dictionary or Lexicon file Postings file
17 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Freq Dictionary/Lexicon Postings Term a aid all and come country dark for good in is it manor men midnight night now of past stormy the their time to was N docs Doc # Tot Freq 4 Freq
18 Inverted indexes Permit fast search for individual terms For each term, you get a list consisting of: document ID frequency of term in doc (optional) position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d, d manor -> d country AND manor -> d
19 Inverted Indexes for Web Search Engines Inverted indexes are still used, even though the web is so huge. Some systems partition the indexes across different machines. Each machine handles different parts of the data. Other systems duplicate the data across many machines; queries are distributed among the machines. Most do a combination of these.
20 Google Architecture
21 Repository Contains the full HTML text Compressed using zlib (RFC950) Prefixed by docid, length, and URL Document Index Each entry contain The current doc status (crawled?) A pointer into the repository (if crawled) A document checksum (using binary search to find the docid ) Various statistics Lexicon Keep in memory on a 56M Current contains 4 million words
22 Google s Indexing The Indexer converts each doc into a collection of hit lists and puts these into barrels, sorted by docid. It also creates a database of links. Hit: <wordid, position in doc, font info, hit type> Hit type: Plain or fancy. Fancy hit: Occurs in URL, title, anchor text, metatag. Optimized representation of hits ( bytes each). Sorter sorts each barrel by wordid to create the inverted index. It also creates a lexicon file. Lexicon: <wordid, offset into inverted index> Lexicon is mostly cached in-memory
23 Google s Inverted Index Each barrel contains postings for a range of wordids. Lexicon (in-memory) Postings ( Inverted barrels, on disk) wordid #docs wordid #docs wordid #docs Sorted by wordid Sorted by Docid Docid Docid Docid Docid Docid #hits #hits #hits #hits #hits Hit, hit, hit, hit, hit Hit Hit, hit Hit Hit, hit, hit Barrel i Barrel i+
24 Google crawler Maintain its own DNS cache Asynchronous I/O to manage events 4 crawler Both URLserver & crawler are implement in Python Each crawler keeps 300 connections open at once >00 pages / s, roughly 600K/s
25 How is Relevance of the page decided???
26 Content Relevance Phrase matching. Synonyms. URL analysis. Date last updated. Spell checking. Home page detection.
27 HTML Weighting Class Name HTML tags ) Plain Text None of the above ) Strong STRONG, B, EM, I, U 3) List DL, OL, UL 4) Header H, H, H3, H4, H5, H6 5) Anchor A 6) Title TITLE Meta tag text is mostly ignored by search engines
28 Link-Based Metrics A link from A to B can be viewed as a recommendation, a vote or a citation. Links can be referential, or informational Links effect the ranking of web pages and thus have commercial value.
29 Citation and Linking
30 PageRank - Motivation The number incoming links to a page is a measure of importance and authority of the page. Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing links are important.
31 The Random Surfer Assume the web is a Markov chain. Surfers randomly click on links, where the probability of an outlink from page A is /m, where m is the number of outlinks from A. The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page. Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page.
32 PageRank (PR) - Definition PR(Wn ) T PR(W ) PR(W ) PR(W ) = + ( T )( ) N O(W ) O(W ) O(Wn ) W is a web page Wi are the web pages that have a link to W O(Wi) is the number of outlinks from Wi T is the teleportation probability N is the size of the web
33 HITS Hubs and Authorities Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub
34 HITS Algorithm Iterate until Convergence A( p ) = H (q ) A(q ) q B q p H ( p) = q B p q B is the base set q and p are web pages in B A(p) is the authority score for p H(p) is the hub score for p
35 HITS Algorithm Overview Input: Q a query string SE a text-based search engine t size of the root set d max number of in links Top t pages (highest-ranked pages) from the text-based search engine form the root set Output: focused subset
36 ( Q = java, SE = AltaVista, t = 3, d = 3)
37 ( Q = java, SE = AltaVista, t = 3, d = 3)
38 Algorithm An Iterative Algorithm [authority] weights vector x0 = (,,,, ) [hub] weights vector y0 = (,,,, ) for i =,,, k xi = update_authorityw(yi-) yi = update_hubw(xi) normalize(xi, yi) return (xk, yk)
39
40
41
42
43
44
45
46
47
48
49
50 Applications of HITS Search engine querying (speed is an issue). Finding web communities. Finding related pages. Populating categories in web directories. Citation analysis.
51 HITS Did not Work Well!!! Mutually Reinforcing Relationship Between Hosts Automatically Generated Links Non-Relevant Node Topic drift approach K edges - /k authority weight L edges - /l hub weight
52 Duplicate Elimination Challenges Defining the notation of a replicated collection precisely Slight differences between copies Efficient algorithm to identify such collection and exploiting this knowledge of replication Hundreds of millions of pages Subgraph isomorphism: NP
53 Reasons for inability to Detect Duplicates Update Frequency Different Formats Partial Crawls
54 Some Solutions IR for Textual similarity Data Mining for clustering so on... Formal definition of similar collections follows: Similar Pages Similar Links
55 Growing Similar Clusters
56 Directories vs. Search Engines Directories Hand-selected sites Search over the contents of the descriptions of the pages Organized in advance into categories Search Engines All pages in all sites Search over the contents of the pages themselves Organized in response to a query by relevance rankings or other scores
57 Directories & Categorization Automatic Categorization -Taper A taxonomy-and-path-enhanced-retrieval system Given Hypertext document corpus A small set of classified documents Goal Construct a classifier Apply to new documents Manual Categorization -OpenGrid and ODP
58
59
60 Good discriminating power: large interclass distance, small intraclass distance
61 Size of the Web Typical Questions Which search engine has the largest coverage? How many pages are out there and how many are indexed? Approach - Measure search engine coverage and overlap through random queries - Allows a third party to measure relative sizes and overlaps of search engines - Take two search engines, E and E, we can: Compute their relative sizes Compute the fraction of E s database indexed by E
62 Size Wars August 005 : We index 0 billion documents. September 005 : We index 8 billion documents, but our index is 3 times larger than our competition s. So, who s right?
63 We knew the web was big... As of 0/30/008 0:33:00 7/5/008 0::00 AM PM 7/5/008 0::00 AM Claimed to have found: trillion (as in,000,000,000,000) unique URLs on the web at once!
64
65
66 Challenges Spam - Text Spam - Link Spam Cloaking Content Quality Quality Evaluation Web Conventions (META tag) Duplicate Hosts Vaguely Structured Data
67 Google File System Performance, Scalability, Reliability, Avalability Component failures are a norm Files are huge Most files are mutated by writing in the end rather than overwriting data Co-designing APIs with File System to increase flexibility
68 GFS
69 Sponsored Search More than 50% users visit a SE every few days Over 3% of traffic to commercial sites was generated by Ses Over 40% of product searches on web were initiated via Ses Pioneered by Goto, later Overture Google
70 Measurement and Pricing Cost per mille Cost per action Cost per click Click through rate = (000*CPM) + CPC Yahoo! - Uses CPC Google Uses CTR * CPC
71 Discussion Large field with great challenges Google really lived upto its name 0^00!!! Highly profitable Internet today is a Big Brain... Can we contribute???
Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search
CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects
More informationCrawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server
Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document
More informationThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started
More informationSearching the Web for Information
Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content
More informationAnatomy of a search engine. Design criteria of a search engine Architecture Data structures
Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection
More informationLogistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy
CSE 454 - Case Studies Indexing & Retrieval in Google Slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Design of Alta Vista Based on a talk by Mike Burrows Group Meetings Starting Tomorrow
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationWeb Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University
Web Search Basics Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction
More informationAdministrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454
Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationCOMP5331: Knowledge Discovery and Data Mining
COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling
More informationRunning Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.
Running Head: 1 How a Search Engine Works Sara Davis INFO 4206.001 Spring 2016 Erika Gutierrez May 1, 2016 2 Search engines come in many forms and types, but they all follow three basic steps: crawling,
More informationWeb Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University
Web Search Basics Berlin Chen Department t of Computer Science & Information Engineering i National Taiwan Normal University References: 1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationSearching the Web What is this Page Known for? Luis De Alba
Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationBrief (non-technical) history
Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University
More informationInternational Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine
International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationIndex Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search
Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationHome Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit
Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability
More informationDefinition. Spider = robot = crawler. Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.
Web Crawlers Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web. What is the Web? (another view) pages containing
More informationEinführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme
Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants
More informationWeb Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search
Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search
More informationUNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.
UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index
More informationLecture 8: Linkage algorithms and web search
Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017
More informationDATA MINING - 1DL105, 1DL111
1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database
More informationInformation Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group
Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationTODAY S LECTURE HYPERTEXT AND
LINK ANALYSIS TODAY S LECTURE HYPERTEXT AND LINKS We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of authority
More informationDATA MINING II - 1DL460. Spring 2014"
DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationAn Introduction to Search Engines and Web Navigation
An Introduction to Search Engines and Web Navigation MARK LEVENE ADDISON-WESLEY Ал imprint of Pearson Education Harlow, England London New York Boston San Francisco Toronto Sydney Tokyo Singapore Hong
More informationSEARCH ENGINE INSIDE OUT
SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationCHAPTER THREE INFORMATION RETRIEVAL SYSTEM
CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost
More informationBuilding Search Applications
Building Search Applications Lucene, LingPipe, and Gate Manu Konchady Mustru Publishing, Oakton, Virginia. Contents Preface ix 1 Information Overload 1 1.1 Information Sources 3 1.2 Information Management
More informationIntroduction to Information Retrieval and Anatomy of Google. Information Retrieval Introduction
Introduction to Information Retrieval and Anatomy of Google Information Retrieval Introduction Earlier we discussed methods for string matching Appropriate for small documents that fit in memory available
More informationSearching the Web [Arasu 01]
Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web
More informationRelevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search
Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!?
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationLink Analysis in Web Mining
Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained
More informationFull-Text Indexing For Heritrix
Full-Text Indexing For Heritrix Project Advisor: Dr. Chris Pollett Committee Members: Dr. Mark Stamp Dr. Jeffrey Smith Darshan Karia CS298 Master s Project Writing 1 2 Agenda Introduction Heritrix Design
More informationIndexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table
Indexing Web pages Web Search: Indexing Web Pages CPS 296.1 Topics in Database Systems Indexing the link structure AltaVista Connectivity Server case study Bharat et al., The Fast Access to Linkage Information
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationLink Analysis and Web Search
Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html
More informationAn Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia
An Overview of Search Engine Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia haixu@microsoft.com July 24, 2007 1 Outline History of Search Engine Difference Between Software and
More informationElementary IR: Scalable Boolean Text Search. (Compare with R & G )
Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationWeb Structure Mining using Link Analysis Algorithms
Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.
More informationBing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web
More informationF. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google
Web Search Engines 1 Web Search before Google Web Search Engines (WSEs) of the first generation (up to 1998) Identified relevance with topic-relateness Based on keywords inserted by web page creators (META
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #11: Link Analysis 3 Seoul National University 1 In This Lecture WebSpam: definition and method of attacks TrustRank: how to combat WebSpam HITS algorithm: another algorithm
More informationCS6200 Information Retreival. The WebGraph. July 13, 2015
CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects
More informationDistributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 05r. Case study: Google Cluster Architecture Paul Krzyzanowski Rutgers University Fall 2016 1 A note about relevancy This describes the Google search cluster architecture in the mid
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationInformation retrieval
Information retrieval Lecture 8 Special thanks to Andrei Broder, IBM Krishna Bharat, Google for sharing some of the slides to follow. Top Online Activities (Jupiter Communications, 2000) Email 96% Web
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationSearch Engines Information Retrieval in Practice
Search Engines Information Retrieval in Practice W. BRUCE CROFT University of Massachusetts, Amherst DONALD METZLER Yahoo! Research TREVOR STROHMAN Google Inc. ----- PEARSON Boston Columbus Indianapolis
More information10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues
COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization
More informationLIST OF ACRONYMS & ABBREVIATIONS
LIST OF ACRONYMS & ABBREVIATIONS ARPA CBFSE CBR CS CSE FiPRA GUI HITS HTML HTTP HyPRA NoRPRA ODP PR RBSE RS SE TF-IDF UI URI URL W3 W3C WePRA WP WWW Alpha Page Rank Algorithm Context based Focused Search
More informationEECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling
EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report
More informationRoadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases
Roadmap Random Walks in Ranking Query in Vagelis Hristidis Roadmap Ranking Web Pages Rank according to Relevance of page to query Quality of page Roadmap PageRank Stanford project Lawrence Page, Sergey
More informationInformation Retrieval. Lecture 11 - Link analysis
Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks
More informationDesktop Crawls. Document Feeds. Document Feeds. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Web crawlers Retrieving web pages Crawling the web» Desktop crawlers» Document feeds File conversion Storing the documents Removing noise Desktop Crawls! Used
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney s IR course at UT Austin) The Web by the Numbers Web servers 634 million Users
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query
More informationCS6200 Information Retreival. Crawling. June 10, 2015
CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on
More informationRanking of ads. Sponsored Search
Sponsored Search Ranking of ads Goto model: Rank according to how much advertiser pays Current model: Balance auction price and relevance Irrelevant ads (few click-throughs) Decrease opportunities for
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationLink Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld
Link Analysis CSE 454 Advanced Internet Systems University of Washington 1/26/12 16:36 1 Ranking Search Results TF / IDF or BM25 Tag Information Title, headers Font Size / Capitalization Anchor Text on
More informationCS47300 Web Information Search and Management
CS47300 Web Information Search and Management Search Engine Optimization Prof. Chris Clifton 31 October 2018 What is Search Engine Optimization? 90% of search engine clickthroughs are on the first page
More informationRecent Researches on Web Page Ranking
Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through
More informationAmbiguity Resolution in Search Engine Using Natural Language Web Application.
Ambiguity Resolution in Search Engine Using Natural Language Web Application. Azeez Nureni Ayofe *1, Azeez Raheem Ajetola 1, and Ade Stanley Oyewole 2 1 Department of Maths and Computer Science, College
More informationChapter 2: Literature Review
Chapter 2: Literature Review 2.1 Introduction Literature review provides knowledge, understanding and familiarity of the research field undertaken. It is a critical study of related reviews from various
More informationAgenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page
Agenda Math 104 1 Google PageRank algorithm 2 Developing a formula for ranking web pages 3 Interpretation 4 Computing the score of each page Google: background Mid nineties: many search engines often times
More informationMeasuring Similarity to Detect
Measuring Similarity to Detect Qualified Links Xiaoguang Qi, Lan Nie, and Brian D. Davison Dept. of Computer Science & Engineering Lehigh University Introduction Approach Experiments Discussion & Conclusion
More informationPart 1: Link Analysis & Page Rank
Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,
More informationChrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO
Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO INDEX Proposal Recap Implementation Evaluation Future Works Proposal Recap Keyword Visualizer (chrome
More information~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~
. Search Engines, history and different types In the beginning there was Archie (990, indexed computer files) and Gopher (99, indexed plain text documents). Lycos (994) and AltaVista (995) were amongst
More informationCRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA
CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com
More informationBruno Martins. 1 st Semester 2012/2013
Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4
More informationInformation Retrieval. Lecture 9 - Web search basics
Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general
More informationPagerank Scoring. Imagine a browser doing a random walk on web pages:
Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably
More informationPAGE RANK ON MAP- REDUCE PARADIGM
PAGE RANK ON MAP- REDUCE PARADIGM Group 24 Nagaraju Y Thulasi Ram Naidu P Dhanush Chalasani Agenda Page Rank - introduction An example Page Rank in Map-reduce framework Dataset Description Work flow Modules.
More informationWeb Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono
Web Mining Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data Management
More informationPlan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis
CS276B Text Retrieval and Mining Winter 2005 Lecture 7 Plan for today Review search engine history (slightly more technically than in the first lecture) Web crawling/corpus construction Distributed crawling
More informationInformation Retrieval. Lecture 4: Search engines and linkage algorithms
Information Retrieval Lecture 4: Search engines and linkage algorithms Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk Today 2
More informationExam IST 441 Spring 2011
Exam IST 441 Spring 2011 Last name: Student ID: First name: I acknowledge and accept the University Policies and the Course Policies on Academic Integrity This 100 point exam determines 30% of your grade.
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant
More informationINTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)
INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)
More informationInformation Retrieval II
Information Retrieval II David Hawking 30 Sep 2010 Machine Learning Summer School, ANU Session Outline Ranking documents in response to a query Measuring the quality of such rankings Case Study: Tuning
More informationDetecting Spam Web Pages
Detecting Spam Web Pages Marc Najork Microsoft Research Silicon Valley About me 1989-1993: UIUC (home of NCSA Mosaic) 1993-2001: Digital Equipment/Compaq Started working on web search in 1997 Mercator
More informationIndexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems
Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next
More informationWeb Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono
Web Mining Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data Management
More information