Information Retrieval. Lecture 9 - Web search basics
|
|
- Edward Ramsey
- 6 years ago
- Views:
Transcription
1 Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester / 30
2 Introduction Up to now: techniques for general Information Retrieval (building and compressing an index, querying, expressing relevance feedback) Today: specific type of Information Retrieval systems web search engines World Wide Web standard IR collections (e.g. newswires) Users of web search engines users of standard IR systems 2/ 30
3 Overview Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 3/ 30
4 Background and history Background and history Complexity of web search comes from: its scale (about 20 billions pages currently) its lack of coordination (decentralized content publishing) the heterogeneity of its contributors (motives and backgrounds) Success of WWW comes from: easy-to-learn edition language (HTML) robust browsers (unknown code is ignored) access to the source-code (learn by example) 4/ 30
5 Background and history Early web search engines and web collections 2 families of engines: (1) full-text index-based search engines (altavista, excite, infoseek) (2) taxonomy-based search engines (yahoo) (2) relies on a classification of documents that may be unintuitive, and has a high cost of maintenance Early collections: tens of millions pages (larger than any prior collection) indexing and fast querying performed successfully, without the expected quality of retrieval new techniques were needed to rank the retrieved pages and deal with the spams 5/ 30
6 Background and history Indexing the web Questions arising when one wants to index the web: Which pages can one trust? How a search engine can assign a measure of trust to a webpage? How to deal with the expansion of the collection? ( ) How to deal with redundancy? (*) By the end of 1995, altavista had crawled 30 millions static webpages (the size of the index was multiplied by 2 every few months) 6/ 30
7 Overview Anatomy of the world wide web Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 7/ 30
8 Anatomy of the world wide web Web as a graph The web can be represented as a graph: webpage node hyperlink directed edge # in-links in-degree of a node # out-links out-degree of a node In-links follows a power law the number of webpages with in-degree i is proportional to 1 i α α 2.1 8/ 30
9 Anatomy of the world wide web Web as a graph (continued) A F C B D E G J K 9/ 30
10 Anatomy of the world wide web Web as a graph (continued) The web has a Bowtie shape where webpages belong to one of these 3 categories: IN (out-degree 0) OUT (in-degree 0) SCC There are no hyperlinks from OUT to SCC, nor from SCC to IN IN OUT SCC IN 10/ 30
11 Anatomy of the world wide web Web as graph (continued) Figure from (Manning et al., 2008) 11/ 30
12 Anatomy of the world wide web Dealing with spam Web considered as a medium to connect advertisers to prospective buyers Sponsored search engines (e.g. Goto using bid for queries) 1 st generation of spam: building documents with specific high-frequency terms, in order to appear first in the retrieval for some queries Cloaking: Is the client a crawler? Yes misleading document No spam A doorway document is used to get highly ranked, but when accessed by a browser, it redirects the user to a spam Current solution: link analysis (more later) 12/ 30
13 Overview Web search users Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 13/ 30
14 Web search users Web search users Improving retrieval results needs to better understand how the search engine is used: users do not know (or care) about the heterogeneity of content users do not know (or care) about the querying syntax users use on average between 2 and 3 keywords Catching a bigger audience needs to better understand how the search engine is used (cf revenue from sponsorised search) The google example: (1) focus on relevance and precision (rather than recall) (2) lightweight user experience (clean input and output graphical interface) 14/ 30
15 Web search users User query needs 3 categories of common search queries: informational (general info. about a broad topic) navigational (particular webpage, precision at 1) transactional (prelude to a transaction, such as purchase, download, etc.) The category of the query should have an impact on the algorithmic search 15/ 30
16 Web search systems Web search systems From (Manning et al., 2008) 16/ 30
17 Overview Estimating the size of the index Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 17/ 30
18 Estimating the size of the index Estimating the size of the index Question: given 2 web search engines, what are the relative size of their indexes? what about documents retrieved while not fully indexed? what about partitioned index (only a proportion of the total index is used for retrieval)? No simple measure of the size of the index 18/ 30
19 Estimating the size of the index Estimating the size of the index (continued) Capture-recapture method: is a page indexed by E 1 also indexed by E 2 and vice versa? x: proportion of pages in E 1 that are in E 2 y: proportion of pages in E 2 that are in E 1 and thus x. E 1 y. E 2 E 1 E 2 y x How to select the sample of indexed webpage? query should not come from a specific group of users webpage should not be hosted by a specific machine (IP sharing) 19/ 30
20 Estimating the size of the index Selecting random queries Idea: pick a page at random from a search engine s index by posing a random query to it (keywords chosen at random) In practice: a) random conjunctive query applied to E 1 top 100 documents b) from these 100 documents page p selected at random c) is p indexed by E 2? select 6-8 low frequency terms in p for querying E 2 Nonetheless, bias from the length of documents, and from the ranking algorithm used Solution: statistical sampling (evaluates the magnitude of the bias) 20/ 30
21 Overview Duplicates and near-duplicates Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 21/ 30
22 Duplicates and near-duplicates Duplicates and near-duplicates 40 % of the web is supposed to be duplicates (e.g. mirroring for access reliability) To find duplicates: computation of a fingerprint (digest of the term sequence) when fingerprints match, the documents are compared (and in case of duplicate, one document is removed from the index) To find near-duplicates: uses shingling k-shingle of a document d is the set of all consecutive sequences of k-terms in d documents are near-duplicates when they share many shingles 22/ 30
23 Duplicates and near-duplicates Finding near-duplicates Let S(d j ) be the set of shingles for document d j J(S(d i ), S(d j )) = S(d i) S(d j ) S(d i ) S(d j ) If J(S(d i ), S(d j )) 0.9, d j is not indexed Cost of computing pairwise comparisons is too high How to estimate this shingle-sharing? (i.e. reduce comparison cost) Estimation using a hash function producing a 64-bit integer applied on shingles 23/ 30
24 Duplicates and near-duplicates Finding near-duplicates (continued) Let us define H(d j ) = {hash(x) x S(d j )} We look for pairs (d i, d j ) such that H(d i ) and H(d j ) have large overlaps Let π be a random bit permutation within a 64-bit integer Let Π(d j ) be the set of permuted hash values in H(d j ) Let x π j be the smallest integer in Π(d j ) Theorem: J(S(d i ), S(d j )) = P[x π i = x π j ] 24/ 30
25 Duplicates and near-duplicates Finding near-duplicates (continued) (Figure from Manning et al., 2008) 25/ 30
26 Duplicates and near-duplicates Finding near-duplicates (continued) Intuition: consider the matrix A where rows are elements i (e.g. hash-values) and columns are sets S j (e.g. H(d x )). a 11 a 1n A =..... a m1 a mn a ij = 1 iff element i is in S j Π is a random permutation of the rows of A Π(S j ) is the permutation of column j x π j is the index of the first row such that Π(S j ) = 1 For two columns m, n: P[x π j m = x π j n ] = J(S jm, S jn ) 26/ 30
27 Duplicates and near-duplicates Finding near-duplicates (continued) Considering two columns S jm and S jn, filled with 1 and 0: S jm J(S jm, S jn ) = S jn C 11 C 01 + C 10 + C 11 This fraction is also P[x π j m = x π j n ] (probability of finding 1 1 during top-down row-scanning) 27/ 30
28 Duplicates and near-duplicates Finding near-duplicates (continued) To sum up: test for Jaccard-overlap based on the permutation Π The values xi π are computed for different documents, if xi π = xj π then the documents i and j are near-duplicates All permutations are not computed, only a sketch ψ i of 200 permutations ψ i ψ j threshold 200 means documents i and j are near-duplicates 28/ 30
29 Conclusion Duplicates and near-duplicates Introduction to the WWW (history, shape, users) Estimate of the size of the index of a web search engine Techniques to remove duplicates (fingerprints) and near-duplicates (shingles) To come: web crawling link analysis (google page rank) 29/ 30
30 Duplicates and near-duplicates References C. Manning, P. Raghavan and H. Schütze Introduction to Information Retrieval chapter19-webchar.pdf Ziv Bar-Yossef and Maxim Gurevich Random Sampling from a Search Engine s Index (2006) Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, Janet Wiener Graph structure in the web (2000) 30/ 30
Information Retrieval. Lecture 11 - Link analysis
Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks
More informationInformation Retrieval. Lecture 10 - Web crawling
Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the
More informationSocial and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo
Social and Technological Network Data Analytics Lecture 5: Structure of the Web, Search and Power Laws Prof Cecilia Mascolo In This Lecture We describe power law networks and their properties and show
More informationLecture 17 November 7
CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has
More informationTopic: Duplicate Detection and Similarity Computing
Table of Content Topic: Duplicate Detection and Similarity Computing Motivation Shingling for duplicate comparison Minhashing LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman
More informationCloak of Visibility. -Detecting When Machines Browse A Different Web. Zhe Zhao
Cloak of Visibility -Detecting When Machines Browse A Different Web Zhe Zhao Title: Cloak of Visibility -Detecting When Machines Browse A Different Web About Author: Google Researchers Publisher: IEEE
More informationWeb Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University
Web Search Basics Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction
More informationWeb Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson
Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Overview Overview Introduction Classic
More informationBrief (non-technical) history
Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University
More informationInformation Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group
Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)
More informationLec 8: Adaptive Information Retrieval 2
Lec 8: Adaptive Information Retrieval 2 Advaith Siddharthan Introduction to Information Retrieval by Manning, Raghavan & Schütze. Website: http://nlp.stanford.edu/ir-book/ Linear Algebra Revision Vectors:
More informationUS Patent 6,658,423. William Pugh
US Patent 6,658,423 William Pugh Detecting duplicate and near - duplicate files Worked on this problem at Google in summer of 2000 I have no information whether this is currently being used I know that
More informationA FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET
A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET Shervin Daneshpajouh, Mojtaba Mohammadi Nasiri¹ Computer Engineering Department, Sharif University of Technology, Tehran, Iran daneshpajouh@ce.sharif.edu,
More informationToday s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications
Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Crawling and Duplicates
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions
More informationInformation Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007
Information Retrieval Lecture 7 - Evaluation in Information Retrieval Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1 / 29 Introduction Framework
More informationInformation Retrieval
Information Retrieval Lecture 7 - Evaluation in Information Retrieval Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 29 Introduction Framework
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 17/26: Web Search Basics Paul Ginsparg Cornell University, Ithaca, NY 29
More informationSome Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing
Some Interesting Applications of Theory PageRank Minhashing Locality-Sensitive Hashing 1 PageRank The thing that makes Google work. Intuition: solve the recursive equation: a page is important if important
More informationDo TREC Web Collections Look Like the Web?
Do TREC Web Collections Look Like the Web? Ian Soboroff National Institute of Standards and Technology Gaithersburg, MD ian.soboroff@nist.gov Abstract We measure the WT10g test collection, used in the
More information信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Last week We have discussed: Evaluation
More informationFinding Similar Sets
Finding Similar Sets V. CHRISTOPHIDES vassilis.christophides@inria.fr https://who.rocq.inria.fr/vassilis.christophides/big/ Ecole CentraleSupélec Motivation Many Web-mining problems can be expressed as
More informationRelevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search
Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!?
More informationWeb Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson
Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Web Search Basics The Web as a graph
More informationWorld Wide Web has specific challenges and opportunities
6. Web Search Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has
More informationWeb Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University
Web Search Basics Berlin Chen Department t of Computer Science & Information Engineering i National Taiwan Normal University References: 1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,
More informationRanking of ads. Sponsored Search
Sponsored Search Ranking of ads Goto model: Rank according to how much advertiser pays Current model: Balance auction price and relevance Irrelevant ads (few click-throughs) Decrease opportunities for
More informationDesktop Crawls. Document Feeds. Document Feeds. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Web crawlers Retrieving web pages Crawling the web» Desktop crawlers» Document feeds File conversion Storing the documents Removing noise Desktop Crawls! Used
More informationCS 345A Data Mining Lecture 1. Introduction to Web Mining
CS 345A Data Mining Lecture 1 Introduction to Web Mining What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns Web Mining v. Data Mining Structure (or lack of
More informationCo-clustering or Biclustering
References: Co-clustering or Biclustering A. Anagnostopoulos, A. Dasgupta and R. Kumar: Approximation Algorithms for co-clustering, PODS 2008. K. Puolamaki. S. Hanhijarvi and G. Garriga: An approximation
More informationConnected Components, and Pagerank
COMP4650 Connected Components, and Pagerank Lexing Xie Research School of Computer Science, ANU Lecture slides credit: Lada Adamic, U Michigan Jure Leskovec, Stanford, Andreas Haeberlen, UPenn Connected
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 19: Web Search Basics Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.07.07 Schütze: Web
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationHome Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit
Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing
More informationCLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB
CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB CIS 601: Graduate Seminar Prof. S. S. Chung Presented By:- Amol Chaudhari CSU ID 2682329 AGENDA About Introduction Contributions Background
More informationFinding dense clusters in web graph
221: Information Retrieval Winter 2010 Finding dense clusters in web graph Prof. Donald J. Patterson Scribe: Minh Doan, Ching-wei Huang, Siripen Pongpaichet 1 Overview In the assignment, we studied on
More informationToday s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications
Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling and Duplicates 2 Sec. 20.2
More information21. Search Models and UIs for IR
21. Search Models and UIs for IR INFO 202-10 November 2008 Bob Glushko Plan for Today's Lecture The "Classical" Model of Search and the "Classical" UI for IR Web-based Search Best practices for UIs in
More informationCS506/606 - Topics in Information Retrieval
CS506/606 - Topics in Information Retrieval Instructors: Class time: Steven Bedrick, Brian Roark, Emily Prud hommeaux Tu/Th 11:00 a.m. - 12:30 p.m. September 25 - December 6, 2012 Class location: WCC 403
More information5. search engine marketing
5. search engine marketing What s inside: A look at the industry known as search and the different types of search results: organic results and paid results. We lay the foundation with key terms and concepts
More informationOn Compressing Social Networks. Ravi Kumar. Yahoo! Research, Sunnyvale, CA. Jun 30, 2009 KDD 1
On Compressing Social Networks Ravi Kumar Yahoo! Research, Sunnyvale, CA KDD 1 Joint work with Flavio Chierichetti, University of Rome Silvio Lattanzi, University of Rome Michael Mitzenmacher, Harvard
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 2 Task: Given a large number (N in the millions or
More information: Semantic Web (2013 Fall)
03-60-569: Web (2013 Fall) University of Windsor September 4, 2013 Table of contents 1 2 3 4 5 Definition of the Web The World Wide Web is a system of interlinked hypertext documents accessed via the Internet
More informationAutomatic Identification of User Goals in Web Search [WWW 05]
Automatic Identification of User Goals in Web Search [WWW 05] UichinLee @ UCLA ZhenyuLiu @ UCLA JunghooCho @ UCLA Presenter: Emiran Curtmola@ UC San Diego CSE 291 4/29/2008 Need to improve the quality
More informationCS47300 Web Information Search and Management
CS47300 Web Information Search and Management Search Engine Optimization Prof. Chris Clifton 31 October 2018 What is Search Engine Optimization? 90% of search engine clickthroughs are on the first page
More informationTHE HISTORY & EVOLUTION OF SEARCH
THE HISTORY & EVOLUTION OF SEARCH Duration : 1 Hour 30 Minutes Let s talk about The History Of Search Crawling & Indexing Crawlers / Spiders Datacenters Answer Machine Relevancy (200+ Factors)
More informationFinding Similar Items:Nearest Neighbor Search
Finding Similar Items:Nearest Neighbor Search Barna Saha February 23, 2016 Finding Similar Items A fundamental data mining task Finding Similar Items A fundamental data mining task May want to find whether
More informationAN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES
Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes
More informationNear Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri
Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions
More informationA Taxonomy of Web Search
A Taxonomy of Web Search by Andrei Broder 1 Overview Ø Motivation Ø Classic model for IR Ø Web-specific Needs Ø Taxonomy of Web Search Ø Evaluation Ø Evolution of Search Engines Ø Conclusions 2 1 Motivation
More informationInformation Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007
Information Retrieval Lecture 5 - The vector space model Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 28 Introduction Boolean model: all documents
More informationSocial Networks 2015 Lecture 10: The structure of the web and link analysis
04198250 Social Networks 2015 Lecture 10: The structure of the web and link analysis The structure of the web Information networks Nodes: pieces of information Links: different relations between information
More informationSearching the Web [Arasu 01]
Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web
More informationThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started
More informationMAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds
MAE 298, Lecture 9 April 30, 2007 Web search and decentralized search on small-worlds Search for information Assume some resource of interest is stored at the vertices of a network: Web pages Files in
More informationCS290N Summary Tao Yang
CS290N Summary 2015 Tao Yang Text books [CMS] Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, Publisher: Addison-Wesley, 2010. Book website. [MRS] Christopher
More informationLink Analysis and Web Search
Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html
More informationCOMP Page Rank
COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea What is this course about? Processing Indexing Retrieving textual data (or audio, video, geo-spatial,, data) Fits in four
More informationReview: Searching the Web [Arasu 2001]
Review: Searching the Web [Arasu 2001] Gareth Cronin University of Auckland gareth@cronin.co.nz The authors of Searching the Web present an overview of the state of current technologies employed in the
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 18/26: Finish Web Search Basics Paul Ginsparg Cornell University, Ithaca,
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 9: Data Mining (3/4) March 8, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability
More informationCombinatorial Algorithms for Web Search Engines - Three Success Stories
Combinatorial Algorithms for Web Search Engines - Three Success Stories Monika Henzinger Abstract How much can smart combinatorial algorithms improve web search engines? To address this question we will
More informationLink Analysis in Web Mining
Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained
More informationUNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.
UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index
More informationLecture 8: Linkage algorithms and web search
Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationNew Issues in Near-duplicate Detection
New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Motivation About 30% of the Web is redundant. [Fetterly 03, Broder
More informationSEARCH ENGINE INSIDE OUT
SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing
More informationEECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling
EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Goals Many Web-mining problems can be expressed as finding similar sets:. Pages with similar words, e.g., for classification
More informationarxiv:cs/ v1 [cs.ir] 26 Apr 2002
Navigating the Small World Web by Textual Cues arxiv:cs/0204054v1 [cs.ir] 26 Apr 2002 Filippo Menczer Department of Management Sciences The University of Iowa Iowa City, IA 52242 Phone: (319) 335-0884
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #11: Link Analysis 3 Seoul National University 1 In This Lecture WebSpam: definition and method of attacks TrustRank: how to combat WebSpam HITS algorithm: another algorithm
More informationLecture 5: Data Streaming Algorithms
Great Ideas in Theoretical Computer Science Summer 2013 Lecture 5: Data Streaming Algorithms Lecturer: Kurt Mehlhorn & He Sun In the data stream scenario, the input arrive rapidly in an arbitrary order,
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 18/25: Web Search Basics Paul Ginsparg Cornell University, Ithaca, NY 4 Nov
More informationCSE 5243 INTRO. TO DATA MINING
CSE 53 INTRO. TO DATA MINING Locality Sensitive Hashing (LSH) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan Parthasarathy @OSU MMDS Secs. 3.-3.. Slides
More informationA STUDY ON THE EVOLUTION OF THE WEB
A STUDY ON THE EVOLUTION OF THE WEB Alexandros Ntoulas, Junghoo Cho, Hyun Kyu Cho 2, Hyeonsung Cho 2, and Young-Jo Cho 2 Summary We seek to gain improved insight into how Web search engines should cope
More informationToday we show how a search engine works
How Search Engines Work Today we show how a search engine works What happens when a searcher enters keywords What was performed well in advance Also explain (briefly) how paid results are chosen If we
More informationSome Characteristics of Web Data and their Reflection on Our Society: an Empirical Approach *
Some Characteristics of Web Data and their Reflection on Our Society: an Empirical Approach * Li Xiaoming and Zhu Jiaji Institute for Internet Information Studies (i 3 S) Peking University 1. Introduction
More informationLecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule
Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule 1 How big is the Web How big is the Web? In the past, this question
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval WS 2008/2009 25.11.2008 Information Systems Group Mohammed AbuJarour Contents 2 Basics of Information Retrieval (IR) Foundations: extensible Markup Language (XML)
More informationText Technologies for Data Science INFR11145 Web Search Walid Magdy Lecture Objectives
Text Technologies for Data Science INFR11145 Web Search (2) Instructor: Walid Magdy 14-Nov-2017 Lecture Objectives Learn about: Basics of Web search Brief History of web search SEOs Web Crawling (intro)
More informationBreadth-First Search Crawling Yields High-Quality Pages
Breadth-First Search Crawling Yields High-Quality Pages Marc Najork Compaq Systems Research Center 13 Lytton Avenue Palo Alto, CA 9431, USA marc.najork@compaq.com Janet L. Wiener Compaq Systems Research
More informationAn introduction to Web Mining part II
An introduction to Web Mining part II Ricardo Baeza-Yates, Aristides Gionis Yahoo! Research Barcelona, Spain & Santiago, Chile ECML/PKDD 2008 Antwerp Yahoo! Research Agenda Statistical methods: the size
More informationLecture #3: PageRank Algorithm The Mathematics of Google Search
Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,
More informationDATA MINING - 1DL105, 1DL111
1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database
More informationRecent Researches on Web Page Ranking
Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through
More informationCHAPTER THREE INFORMATION RETRIEVAL SYSTEM
CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost
More informationSEO 1 8 O C T O B E R 1 7
SEO 1 8 O C T O B E R 1 7 Search Engine Optimisation (SEO) Search engines Search Engine Market Global Search Engine Market Share June 2017 90.00% 80.00% 79.29% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00%
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 20: Crawling Hinrich Schütze Center for Information and Language Processing, University of Munich 2009.07.14 1/36 Outline 1 Recap
More informationAdministrative. Web crawlers. Web Crawlers and Link Analysis!
Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt
More informationAdministrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454
Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search
More informationDATA MINING II - 1DL460. Spring 2014"
DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationModule 1: Internet Basics for Web Development (II)
INTERNET & WEB APPLICATION DEVELOPMENT SWE 444 Fall Semester 2008-2009 (081) Module 1: Internet Basics for Web Development (II) Dr. El-Sayed El-Alfy Computer Science Department King Fahd University of
More informationRandom Sampling from a Search Engine s Index
Random Sampling from a Search Engine s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion 1 Search Engine Samplers Search Engine Web Queries Public Interface Sampler Top
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationMathematical Analysis of Google PageRank
INRIA Sophia Antipolis, France Ranking Answers to User Query Ranking Answers to User Query How a search engine should sort the retrieved answers? Possible solutions: (a) use the frequency of the searched
More information