World Wide Web has specific challenges and opportunities

Size: px
Start display at page:

Download "World Wide Web has specific challenges and opportunities"

Transcription

1 6. Web Search

2 Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has specific challenges and opportunities it is decentralized, i.e., there is no central repository from which we can obtain all available web pages web pages are connected via hyperlinks, which provides us with additional information, potentially including anchor texts search engines can monitor the behavior of their users (e.g., which results they click on) and use this to improve the effectiveness of search results 2

3 A Brief History of Web Search 1990s: Early web search engines (e.g., Excite and Altavista) mostly relied on traditional methods from Information Retrieval (e.g., tf.idf-based models that only looked at text in web pages) 1998: Google, as a project out of Stanford University, with its PageRank algorithm makes clever use of the Web s link structure to estimate the importance of web pages since mid-2000s: search engines increasingly rely on learningto-rank methods that rely on machine learning to rank result documents based on observed user behavior and taking into account a large number of signals 3

4 Agenda 6.1 The Web as a Graph 6.2 Crawling the Web 6.3 Link Analysis 6.4 Web Spam 6.5 Learning-to-Rank 4

5 6.1 The Web as a Graph The World Wide Web can be seen as a directed graph G(V, E) web pages (URLs) as vertices V (nodes) hyperlinks as directed edges E Adjacency Matrix: A = S T W X U V

6 Degree Distributions Researchers have investigated different structural properties of the web graph including its degree distributions Distributions of indegrees and outdegrees follow power laws P[ = k ] Ã k with α 2.1 (indegrees) and α 2.7 (outdegrees) 6

7 Bow-Tie Structure Connectivity structure of the web graph resembles a bow tie large strongly connected component (SCC) of web pages that are reachable from each other smaller sets IN and OUT from which web pages in the SCC can be reached or that are reachable from SCC 7

8 Small Diameter The diameter of a graph is defined as the maximal length of any shortest path between two nodes a small diameter means that we can get from any other node in few steps The web graph has been observed to have a small diameter within its SCC, i.e., we can get from any web page to any other web page by following few hyperlinks Small diameters were first observed in the analysis of social networks and are known as the small world phenomenon 8

9 Small-World Phenomenon Small-world phenomenon was first mentioned in a famous experiment conducted by Stanley Milgram in 1967 participants of the experiment had to forward a package to a target person located in Boston and were only allowed to pass on the package to a person known to them on average, it took about six steps (i.e., passes) to route the packet to the target person This result is famously known as six degrees of separation, alluding to the fact that everybody seems connected to anybody else via a few steps 9

10 Random Graph Models Erdös-Renyi model G(n, p) randomly generates a graph having n nodes, in which each possible edge exists with probability p Barabási-Albert model randomly generates a graph by iteratively adding nodes, each of which has m outgoing edges; the target node of such an edge is chosen with probability depending on its current indegree (i.e., the rich get richer ) Barabási-Albert model yields graphs in which the distribution of indegrees follows a power law (i.e., similar to web graph) 10

11 6.2 Crawling the Web World Wide Web is inherently decentralized, i.e., web pages are stored on geographically distributed web servers, and there is no central repository of all available URLs When building a search engine, one thus has to first discover and download web pages to be indexed Search engines employ so-called crawlers (also: spiders) to discover and download web pages; a crawler starts from a set of seed pages and follows hyperlinks therein to discover other web pages and traverse the Web 11

12 Crawling Illustrated D B A E F Frontier : B, C C G Frontier new PriorityQueue() Frontier.add(Seeds) // insert seeds while!frontier.isempty() URL Frontier.Remove() // get next URL URL.fetch() // retrieve URL s contents Frontier.add(URL.getLinks()) // enqueue links 12

13 Challenges Politeness a crawler may only request permitted resources (in robots.txt) requests to the same web server have to be issued at a moderate rate (e.g., 1 request/second) to avoid overloading it Robustness crawler traps (e.g., dynamically generated calendars) incomplete/invalid HTML documents network problems (e.g., high latency, low bandwidth, temporary unavailability of web servers) 13

14 Robots Exclusion Standard (robots.txt) Robots Exclusion Standard demands that crawlers first request the file robots.txt, in which the content provider can specify which crawlers can access which contents at which rate User-agent: WebReaper Disallow: / User-agent: Slurp Crawl-delay: 18 User-agent: * Disallow: /active/ Disallow: /artikelversand/ Disallow: /cgi-bin/ Disallow: /staticgen/mobil/ 14

15 6.3 Link Analysis The link structure of the World Wide Web can be analyzed to obtain information about the importance of web pages PageRank, as used in the original Google search engine, is based on the intuition that a web page is important, if other important web pages contain hyperlinks to it PageRank is based on a random walk on the web graph, which can be formally described and analyzed as a so-called Markov chain 15

16 PageRank PageRank is based on a random walk on the web graph: in each step, a random surfer chooses between two options with probability ε the random surfer performs a random jump to any of the V nodes (web page) in the web graph with probability (1 ε) the random surfer follows on of the outgoing edges (hyperlinks) of the currently visited node The PageRank p(v ) of the node v is recursively defined as p(v) =(1 ) ÿ (u,v)œe p(u) out(u) + V and reflects the importance of the node v 16

17 PageRank PageRank values can be computed based on a transition probability matrix P, whose entries reflect with which probability the random surfer moves from one node to another node I (1 )/out(u)+ / V : (u, v) œ E P (u,v) = / V : P = S T W X U V =0.2 17

18 PageRank PageRank are computed using the power iteration method start with an vector of initial state probabilities π (0) fi (0) = # 1/ V... 1/ V $ compute state probabilities after one step of the random surfer as fi (1) = fi (0) P compute state probabilities after i steps of the random surfer as fi (i) = fi (i 1) P terminate computation once state probabilities have converged 18

19 PageRank Power iteration method applied to our example graph fi (0) = # $ fi (1) = # $ fi (2) = # $ fi (10) = # $ P = S T W X U V

20 6.4 Web Spam Many web sites have commercial interests and hope to attract more visitors from search engines Search Engine Optimization (SEO) seeks to make web pages more findable (e.g., by optimizing their meta data) Web spam refers to a family of techniques that seek to bring up web pages in search results for specific queries (e.g., by manipulating their contents) 20

21 Web Spam Gyöngyi and Garcia-Molina distinguish three kinds of web spam term spam manipulates the contents of web pages link spam manipulates the link structure of the web graph content hiding hides the actual content of web pages Web spam techniques evolved in parallel to search engines in a kind of arms race: initially: mostly content manipulation, targeting tf.idf-style methods then: manipulation of web graph, targeting link analysis methods more recently: manipulation of social media (e.g., comments, likes) to affect relevant signals in learning-to-rank methods 21

22 Term Spam Assumption: Search engines relies on a retrieval model that considers the term frequency Idea: Augment content of web pages with occurrences of terms, for which it should be returned in search results Our charming hotel close to Oslo (Norway) Our charming hotel close to Oslo (Norway) hotel hotel hotel norway norway norway fjord fjord cheap cheap cheap To avoid irritating users, additional term occurrences are added, so that they are not visible (e.g., same color as background) 22

23 Link Spam Assumption: Search engines relies on a link analysis method (e.g., PageRank) to estimate the importance of web pages Idea: manipulate other important web pages (e.g., DMOZ or Wikipedia), so that they point to own web pages

24 Link Spam Honey pots are collections of actually useful web pages (e.g., a copy of Wikipedia) modified to point to own web pages Link directories (e.g., DMOZ) can be manipulated to include hyperlinks to own web pages Comments in forums or social media sites can be created to include hyperlinks to own web pages Spam farms are collections of different web sites constructed solely to make own web pages appear more important 24

25 6.5 Learning-to-Rank Nowadays, search engines rely on a multitude of different signals to rank documents in response to a query, e.g.: textual relevance (e.g., based on a retrieval model) link-based importance (e.g., based on PageRank) spam probability (e.g., estimated using a classifier) user popularity (e.g., estimated on observed clicks) textual quality (e.g., based on typos in content) readability (e.g., based on sentence length) age (e.g., based on date of last modification) 25

26 Learning-to-Rank Search engines need to combine these different signals in a meaningful manner, so that search results are effective Machine learning methods can learn how to combine these signals in an effective manner based on observed user behavior This can be done, for example, by casting it into a classification problem and trying to predict whether a user will click or not click on a specific document for a specific query 26

27 Learning-to-Rank # $ R # $ N # $ N # $ R Training data with classes non-relevant (N) relevant (R) Classifier # ? $ Query time: Class R/N is determined for a previously unseen document and query 27

28 Learning-to-Rank Training data for classifier is obtained, e.g.: based on relevance judgments provided by assessors based on observed user behavior (e.g., does the user click on a document for a specific query or not) When processing a query, a common approach today is to identify the top-1000 documents using a base ranker (e.g., retrieval model combined with link analysis), which are then re-ranked using the learned classifier, taking all available signals into account 28

29 Summary World Wide Web can be seen as a directed graph, which has a small diameter and degrees whose distributions follow power laws Decentralized nature of World Wide Web requires that documents have to be first collected using a crawler PageRank as a link-analysis methods that estimates the importance of web pages based on the web graphs Learning-to-rank as a common approach to combine multiple signals to yield effective search results 29

30 Literature [1] C. D. Manning, P. Raghavan, and H. Schütze: Introduction to Information Retrieval, Cambridge University Press, 2008 (Chapter 20 & 21) [2] W. B. Croft, D. Metzler, and T. Strohman: Search Engines Information Retrieval in Practice, Pearson Education, 2009 (Chapter 3 & 4) 30

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability

More information

CS47300 Web Information Search and Management

CS47300 Web Information Search and Management CS47300 Web Information Search and Management Search Engine Optimization Prof. Chris Clifton 31 October 2018 What is Search Engine Optimization? 90% of search engine clickthroughs are on the first page

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans.

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans. 1 After WWW protocol was introduced in Internet in the early 1990s and the number of web servers started to grow, the first technology that appeared to be able to locate them were Internet listings, also

More information

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Administrative. Web crawlers. Web Crawlers and Link Analysis! Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt

More information

Brief (non-technical) history

Brief (non-technical) history Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

University of Maryland. Tuesday, March 2, 2010

University of Maryland. Tuesday, March 2, 2010 Data-Intensive Information Processing Applications Session #5 Graph Algorithms Jimmy Lin University of Maryland Tuesday, March 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

COMP Page Rank

COMP Page Rank COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017

More information

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454 Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search

More information

An Introduction to Search Engines and Web Navigation

An Introduction to Search Engines and Web Navigation An Introduction to Search Engines and Web Navigation MARK LEVENE ADDISON-WESLEY Ал imprint of Pearson Education Harlow, England London New York Boston San Francisco Toronto Sydney Tokyo Singapore Hong

More information

Lec 8: Adaptive Information Retrieval 2

Lec 8: Adaptive Information Retrieval 2 Lec 8: Adaptive Information Retrieval 2 Advaith Siddharthan Introduction to Information Retrieval by Manning, Raghavan & Schütze. Website: http://nlp.stanford.edu/ir-book/ Linear Algebra Revision Vectors:

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney s IR course at UT Austin) The Web by the Numbers Web servers 634 million Users

More information

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds MAE 298, Lecture 9 April 30, 2007 Web search and decentralized search on small-worlds Search for information Assume some resource of interest is stored at the vertices of a network: Web pages Files in

More information

SEOHUNK INTERNATIONAL D-62, Basundhara Apt., Naharkanta, Hanspal, Bhubaneswar, India

SEOHUNK INTERNATIONAL D-62, Basundhara Apt., Naharkanta, Hanspal, Bhubaneswar, India SEOHUNK INTERNATIONAL D-62, Basundhara Apt., Naharkanta, Hanspal, Bhubaneswar, India 752101. p: 305-403-9683 w: www.seohunkinternational.com e: info@seohunkinternational.com DOMAIN INFORMATION: S No. Details

More information

Search Engines. Charles Severance

Search Engines. Charles Severance Search Engines Charles Severance Google Architecture Web Crawling Index Building Searching http://infolab.stanford.edu/~backrub/google.html Google Search Google I/O '08 Keynote by Marissa Mayer Usablity

More information

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google Web Search Engines 1 Web Search before Google Web Search Engines (WSEs) of the first generation (up to 1998) Identified relevance with topic-relateness Based on keywords inserted by web page creators (META

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5) INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Social Networks 2015 Lecture 10: The structure of the web and link analysis 04198250 Social Networks 2015 Lecture 10: The structure of the web and link analysis The structure of the web Information networks Nodes: pieces of information Links: different relations between information

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris Manning at Stanford U.) The Web as a Directed Graph

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 17 September 2018 Some slides courtesy Manning, Raghavan, and Schütze Other characteristics Significant duplication Syntactic

More information

Corso di Biblioteche Digitali

Corso di Biblioteche Digitali Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

Corso di Biblioteche Digitali

Corso di Biblioteche Digitali Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto

More information

Information Retrieval. Lecture 10 - Web crawling

Information Retrieval. Lecture 10 - Web crawling Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the

More information

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011 Data-Intensive Information Processing Applications! Session #5 Graph Algorithms Jordan Boyd-Graber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Bruno Martins. 1 st Semester 2012/2013

Bruno Martins. 1 st Semester 2012/2013 Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4

More information

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis CS276B Text Retrieval and Mining Winter 2005 Lecture 7 Plan for today Review search engine history (slightly more technically than in the first lecture) Web crawling/corpus construction Distributed crawling

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena CSE 190 Lecture 16 Data Mining and Predictive Analytics Small-world phenomena Another famous study Stanley Milgram wanted to test the (already popular) hypothesis that people in social networks are separated

More information

The PageRank Citation Ranking: Bringing Order to the Web

The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Marlon Dias msdias@dcc.ufmg.br Information Retrieval DCC/UFMG - 2017 Introduction Paper: The PageRank Citation Ranking: Bringing Order to the Web,

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Link Analysis in Web Mining

Link Analysis in Web Mining Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Graph Algorithms. Revised based on the slides by Ruoming Kent State Graph Algorithms Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/

More information

Link Analysis in the Cloud

Link Analysis in the Cloud Cloud Computing Link Analysis in the Cloud Dell Zhang Birkbeck, University of London 2017/18 Graph Problems & Representations What is a Graph? G = (V,E), where V represents the set of vertices (nodes)

More information

~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~

~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~ . Search Engines, history and different types In the beginning there was Archie (990, indexed computer files) and Gopher (99, indexed plain text documents). Lycos (994) and AltaVista (995) were amongst

More information

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL Web mining - Outline Introduction Web Content Mining Web usage

More information

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology Seminar: Future Of Web Search University of Saarland Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood Tutor : Klaus Berberich Date : 17-Jan-2008 The Agenda

More information

Mathematical Analysis of Google PageRank

Mathematical Analysis of Google PageRank INRIA Sophia Antipolis, France Ranking Answers to User Query Ranking Answers to User Query How a search engine should sort the retrieved answers? Possible solutions: (a) use the frequency of the searched

More information

Graph and Link Mining

Graph and Link Mining Graph and Link Mining Graphs - Basics A graph is a powerful abstraction for modeling entities and their pairwise relationships. G = (V,E) Set of nodes V = v,, v 5 Set of edges E = { v, v 2, v 4, v 5 }

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 21: Link Analysis Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-06-18 1/80 Overview

More information

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS Overview of Networks Instructor: Yizhou Sun yzsun@cs.ucla.edu January 10, 2017 Overview of Information Network Analysis Network Representation Network

More information

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page Agenda Math 104 1 Google PageRank algorithm 2 Developing a formula for ranking web pages 3 Interpretation 4 Computing the score of each page Google: background Mid nineties: many search engines often times

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

COMP 4601 Hubs and Authorities

COMP 4601 Hubs and Authorities COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one

More information

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo Social and Technological Network Data Analytics Lecture 5: Structure of the Web, Search and Power Laws Prof Cecilia Mascolo In This Lecture We describe power law networks and their properties and show

More information

Introduction To Graphs and Networks. Fall 2013 Carola Wenk

Introduction To Graphs and Networks. Fall 2013 Carola Wenk Introduction To Graphs and Networks Fall 203 Carola Wenk On the Internet, links are essentially weighted by factors such as transit time, or cost. The goal is to find the shortest path from one node to

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants

More information

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0 Website Name Project Code: #10001 Version: 1.0 DocID: SEO/site/rec Issue Date: DD-MM-YYYY Prepared By: - Owned By: Rave Infosys Reviewed By: - Approved By: - 3111 N University Dr. #604 Coral Springs FL

More information

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Crawling CE-324: Modern Information Retrieval Sharif University of Technology Crawling CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Sec. 20.2 Basic

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 5: Graph Processing Jimmy Lin University of Maryland Thursday, February 21, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

12. Web Spidering. These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

12. Web Spidering. These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin. 12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin. 1 Web Search Web Spider Document corpus Query String IR System 1. Page1 2. Page2

More information

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7: Information Retrieval II Aidan Hogan aidhog@gmail.com How does Google know about the Web? Inverted Index: Example 1 Fruitvale Station is a 2013

More information

Unit VIII. Chapter 9. Link Analysis

Unit VIII. Chapter 9. Link Analysis Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2

More information

Agenda. 1 Web search. 2 Web search engines. 3 Web robots, crawler. 4 Focused Web crawling. 5 Web search vs Browsing. 6 Privacy, Filter bubble

Agenda. 1 Web search. 2 Web search engines. 3 Web robots, crawler. 4 Focused Web crawling. 5 Web search vs Browsing. 6 Privacy, Filter bubble Agenda EITF25 Internet - Web Search Anders Ardö EIT Electrical and Information Technology, Lund University November 28, 2013 A. Ardö, EIT EITF25 Internet - Web Search November 28, 2013 1 / 47 A. Ardö,

More information

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur

More information

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

More information

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Popularity of Twitter Accounts: PageRank on a Social Network

Popularity of Twitter Accounts: PageRank on a Social Network Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,

More information

Detecting Spam Web Pages

Detecting Spam Web Pages Detecting Spam Web Pages Marc Najork Microsoft Research Silicon Valley About me 1989-1993: UIUC (home of NCSA Mosaic) 1993-2001: Digital Equipment/Compaq Started working on web search in 1997 Mercator

More information

THE HISTORY & EVOLUTION OF SEARCH

THE HISTORY & EVOLUTION OF SEARCH THE HISTORY & EVOLUTION OF SEARCH Duration : 1 Hour 30 Minutes Let s talk about The History Of Search Crawling & Indexing Crawlers / Spiders Datacenters Answer Machine Relevancy (200+ Factors)

More information

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015 CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects

More information

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Crawling and Duplicates

More information

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)! Lecture 11: Graph algorithms!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the scenes of MapReduce:

More information

How Does a Search Engine Work? Part 1

How Does a Search Engine Work? Part 1 How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling

More information

How to organize the Web?

How to organize the Web? How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper

More information

COMMUNICATIONS METRICS, WEB ANALYTICS & DATA MINING

COMMUNICATIONS METRICS, WEB ANALYTICS & DATA MINING Dipartimento di Scienze Umane COMMUNICATIONS METRICS, WEB ANALYTICS & DATA MINING A.A. 2017/2018 Take your time with a PRO in Comms @LUMSA Rome, 15 december 2017 Francesco Malmignati Chief Technical Officer

More information

ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH

ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH Abstract We analyze the factors contributing to the relevance of a web-page as computed by popular industry web search-engines. We also

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 20: Crawling Hinrich Schütze Center for Information and Language Processing, University of Munich 2009.07.14 1/36 Outline 1 Recap

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

SEARCH ENGINE INSIDE OUT

SEARCH ENGINE INSIDE OUT SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing

More information

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

DSCI 575: Advanced Machine Learning. PageRank Winter 2018 DSCI 575: Advanced Machine Learning PageRank Winter 2018 http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf Web Search before Google Unsupervised Graph-Based Ranking We want to rank importance based on

More information

Collection Building on the Web. Basic Algorithm

Collection Building on the Web. Basic Algorithm Collection Building on the Web CS 510 Spring 2010 1 Basic Algorithm Initialize URL queue While more If URL is not a duplicate Get document with URL [Add to database] Extract, add to queue CS 510 Spring

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

DATA MINING II - 1DL460. Spring 2017

DATA MINING II - 1DL460. Spring 2017 DATA MINING II - 1DL460 Spring 2017 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt17 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press. Link Analysis SEEM5680 Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press. 1 The Web as a Directed Graph Page A Anchor hyperlink Page

More information

Informa(on Retrieval

Informa(on Retrieval Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 12: Crawling and Link Analysis 2 1 Ch. 11-12 Last Time Chapter 11 1. ProbabilisCc Approach to Retrieval / Basic Probability Theory

More information

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. 1 Contents Introduction Network properties Social network analysis Co-citation

More information

Chapter 2: Literature Review

Chapter 2: Literature Review Chapter 2: Literature Review 2.1 Introduction Literature review provides knowledge, understanding and familiarity of the research field undertaken. It is a critical study of related reviews from various

More information

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies Large-Scale Networks PageRank Dr Vincent Gramoli Lecturer School of Information Technologies Introduction Last week we talked about: - Hubs whose scores depend on the authority of the nodes they point

More information

SE Workshop PLAN. What is a Search Engine? Components of a SE. Crawler-Based Search Engines. How Search Engines (SEs) Work?

SE Workshop PLAN. What is a Search Engine? Components of a SE. Crawler-Based Search Engines. How Search Engines (SEs) Work? PLAN SE Workshop Ellen Wilson Olena Zubaryeva Search Engines: How do they work? Search Engine Optimization (SEO) optimize your website How to search? Tricks Practice What is a Search Engine? A page on

More information

This tutorial has been prepared for beginners to help them understand the simple but effective SEO characteristics.

This tutorial has been prepared for beginners to help them understand the simple but effective SEO characteristics. About the Tutorial Search Engine Optimization (SEO) is the activity of optimizing web pages or whole sites in order to make them search engine friendly, thus getting higher positions in search results.

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Information Retrieval. Lecture 4: Search engines and linkage algorithms Information Retrieval Lecture 4: Search engines and linkage algorithms Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk Today 2

More information

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the random walk When

More information

Web Crawling As Nonlinear Dynamics

Web Crawling As Nonlinear Dynamics Progress in Nonlinear Dynamics and Chaos Vol. 1, 2013, 1-7 ISSN: 2321 9238 (online) Published on 28 April 2013 www.researchmathsci.org Progress in Web Crawling As Nonlinear Dynamics Chaitanya Raveendra

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information