# INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5)

2 Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS) Analyze the structure of very large graph (Web) Link Analysis

3 PageRank

4 Early SE and Term Spam Early Search Engines invented term search Crawl the Web Extract teems (e.g. words) from each page Create an inverted index (what terms in which pages) Query processing Find all pages with query trems Rank pages according to importance/relevance E.g. term in the title of a page is more important Spammers invented term spam Add fake terms (in invisible font) Run popular query, see what page comes first, copy it

5 Google Innovation PageRank Simulate a random surfer starting from a random page following random outlinks Important pages has large chance to be on the simulated random path Page importance and terms are used for ranking Terms around the link Relevance of the page is according to terms within the page and terms around links to this page

6 Definition of PageRank A function that assigns a real number to each Page More important pages get a higher PageRank Web as a directed graph(nodes-pages, link-edges)

7 Transition Matrix Probability of jumping from node i to node j Assume equal probability (k out links, 1/k probability each) PageRank is a column vector Probability to be at node i

8 Stable Distribution Assume initial probability to be at each state is a vector v 0 = 1 n, 1 n,, Transition matrix M 1 n What is the probability after a single step? x = Mv 0 x i = j m ij v j After k steps x k = M k v 0 = MM Mv 0

9 Markov Process Distribution to be on a node i at step k depends only on distribution of nodes at time k-1. Exists a limiting distribution v = Mv provided The graph is strongly connected (possible to get from any node to any node) There are no dead ends (nodes that have no arcs out) Limiting distribution is an eigenvector of M

10 Principle Eigenvector Transition matrix M is stochastic (each column adds up to 1) Limiting distribution is the principle eigenvector (associated with largest eigenvalue) v = λmv Computation: iterate my multiplying by matrix M till no significant change iterations for Web

11 Example Assuming transition matrix Successive multiplications

12 Structure of the Web In practice, web is not strongly connected graph

13 Structure of the Web Large strongly connected component (SCC) In-component Reach the SCC but could not but not reachable from the SCC Out-component Reachable from the SCC but unable to reach the SCC Two types of Tendrils From the in-component To the out-component Tubes from the in-component to the outcomponent Isolated component

14 Two general problems Dead-ends Page with no links out Spider traps Groups of pages that do not have links to any other pages Each page has out-links within the group

15 Avoiding Dead Ends Transition matrix is not stochastic (all zero column) Substochastic matrix- column sums are at most 1 Increasing power of M leads to some/all elements of v going to zero. Example

16 Dropping dead ends Drop dead ends and their incoming arcs from the graph Other nodes may become dead ends Drop recursively to obtain a strongly connected component Compute PageRank on the remaining graph Restore graph by adding nodes back in reverse order Computing PageRank for restored nodes Each parent with PageRank p and number of outlinks k contribute p/k to the restored node

17 Example Drop dead ends PageRank on reduced graph Restore C: Restore E: Single parent, same PageRank Result is not a distribution (does not sum up to 1)

18 Spider Traps and Taxation Example

19 Teleporting A random surfer has a small probability of jumping from any page to any page e is a vector of all 1 s and β is a small probability (0.15) For dead ends Always a probability to get out

20 Example Assume β = 0.8

21 Using PageRank in a SE A secret formula for ranking pages in response to a query Terms relevance PageRank Other 250 properties of pages (Google)

22 Efficient Computation of PageRank

23 PageRank for a large graph 50 iterations of matrix-vector multiplication MapReduce method The transition matrix M is very sparse Represent only non-zero elements Modify MapReduce stripping approach to reduce amount of data passed from Map tasks to Reduce tasks

24 Representing Transition Matrices 10B pages, 10 links per page 1 of each 1B entries is not zero 4 bytes per coordinate index, and 8 bytes for value Total 16 bytes per non-zero entry List all non-zero entries by column Single integer for a number of non-zeroes 4 bytes for row number per each non-zero entries

25 Example Transition Matrix Representation

26 PageRank Iteration Using MapReduce Iteration For small n store vector in the main memory of each node Map i, j, m ij i, m ij v j Reduce i, m i1 v 1,, m in v n j m ij v j Large n: break M into vertical stripes, v into horizontal stripes Break M into blocks, v into stripes

27 Topic-Sensitive PageRank

28 Motivation Search jaguar Animal, Automobile, MAC OS, ancient game console If SE can guess the topic More relevant results Select small number of topics Create PageRank vector for each topic (eg. 16 DMOZ) Detect user interest with respect to one of these topics

29 Biased Random Walk Assume random surfers start only from a random sport page Teleport set S of sport pages Usage Decide on topics Select teleport set of each topic Find a way to decide on topic(s) relevant to query Use appropriate PageRank vector

31 Architecture of a Spam Farm Spammers constantly try to improve the PageRank of their pages Web from the point of view of a spammer Inaccessible pages (amazon) Accessible pages (blog) Own pages (spam)

32 Spam Farm Single target page and m supporting pages

33 Analysis of a Spam Farm x- PageRank contributed by accessible pages β i p i k i, p i PageRank, k i number of outlinks y- unknown PageRank of target page PageRank of each supporting page is

34 PageRank of Target Page Contribution x from outside Contribution of every supporting page Contribution from teleported surfers (ignore) 1 β Total Solve n

35 Example Assume β = 0.86, c = 0.46 y = 3.6 x m n Amplify x, contribution by outer page by 360% 46% of the fraction of the Web

36 Combating Link Spam Battle between SE to detect spam-farm-like structures and spammers to invent new ones Consider TrustRank- a variation of topic sensitive PageRank designed to lower the score of spam pages Spam mass- identify pages that are likely to be spam

37 TrustRank Let S- teleport set to be a set of pages that are considered to be trustworthy Can t inject spam links into them (e.g. no talkbacks) Selecting trustworthy pages Human selected pages Pages from a specific domains (.edu.mil,.gov)

38 Spam Mass Measure fraction of page PageRank that comes from spam Compute PageRank r Compute TrustRank t The spam mass is r t r Not a spam: negative or small positive Spam: close two one (t is almost zero)

39 Example Trustworthy pages B and D No spam pages

40 Hubs and Authorities

41 HITS Probably used by Ask.com SE Hyperlink induced topic search (HITS) Originally intended to help ranking of query results Not a pre-processing step as PageRank We apply to the entire Web

42 The Intuition Behind HITS Authorities: Certain page are valuable because they provide information about a topic Hubs: Other pages are valuable as they point to good pages about that topic Example A homepage of the faculty is a HUB A homepage of each course is an Authority Recursive definition Good hub if links to good authorities Good authorities if it is linked by a good hub

43 Formalizing Hubbiness and Authority Link matrix of the Web L 1 if there is a link from i to j. Transpose L T : 1 if a link from j to I L T is similar to transition matrix M (M has probabilities)

44 Scores Let h and a be score vectors fro hubbines and authority respectively Scale each vector to sum 1 Computation h = λla, a = μl T h, with scaling constants λ and μ Substitute h = λlμl t h = λμll T h a = μl T λla = λμl T La

45 Computing L T L is much more sparse compared to L Better compute h and a by a true mutual recursion Algorithm Compute a = μl T h and scale Compute h = λla and scale Repeat until changes are small

46 Summary

47 Summary Term spam inject terms and copy pages PageRank and Transition Matrix Page importance defined by a random surfer Dead ends and Spider Traps Taxations/teleporting and removal of dead ends Combatting Spam Farms TrustRank and Spam Mass Topic-sensitive PageRank Teleport sets Hubs and authorities Mutually recursive definition

### Unit VIII. Chapter 9. Link Analysis

Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2

### Big Data Analytics CSCI 4030

High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

### Big Data Analytics CSCI 4030

High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

### Part 1: Link Analysis & Page Rank

Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

### Slides based on those in:

Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]

### 3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

### Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur

### Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information

### CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/6/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 High dim. data Graph data Infinite data Machine

### CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--1 S535 IG T FQs Programming ssignment 1 We will discuss link analysis in week3 Installation/configuration

### CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the random walk When

### Link Analysis in Web Mining

Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained

### Introduction to Data Mining

Introduction to Data Mining Lecture #10: Link Analysis-2 Seoul National University 1 In This Lecture Pagerank: Google formulation Make the solution to converge Computing Pagerank for very large graphs

### CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

### Jeffrey D. Ullman Stanford University/Infolab

Jeffrey D. Ullman Stanford University/Infolab Spamming = any deliberate action intended solely to boost a Web page s position in searchengine results. Web Spam = Web pages that are the result of spamming.

### COMP Page Rank

COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper

### COMP 4601 Hubs and Authorities

COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one

### Mining Web Data. Lijun Zhang

Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

### CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

### Mining Web Data. Lijun Zhang

Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

### How to organize the Web?

How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper

### Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Agenda Math 104 1 Google PageRank algorithm 2 Developing a formula for ranking web pages 3 Interpretation 4 Computing the score of each page Google: background Mid nineties: many search engines often times

### Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

### DATA MINING - 1DL460

DATA MINING - 1DL460 Spring 2015 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt15 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

### DATA MINING - 1DL460

DATA MINING - 1DL460 Spring 2015 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt15 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

### Introduction to Information Retrieval

Introduction to Information Retrieval http://informationretrieval.org IIR 21: Link Analysis Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-06-18 1/80 Overview

### DATA MINING - 1DL460

DATA MINING - 1DL460 Spring 2013" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

### The PageRank Citation Ranking

October 17, 2012 Main Idea - Page Rank web page is important if it points to by other important web pages. *Note the recursive definition IR - course web page, Brian home page, Emily home page, Steven

### TODAY S LECTURE HYPERTEXT AND

LINK ANALYSIS TODAY S LECTURE HYPERTEXT AND LINKS We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of authority

### DATA MINING - 1DL460

DATA MINING - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

### Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Page rank computation HPC course project a.y. 2012-13 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet

### 1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

!"#\$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and

Link Analysis CSE 454 Advanced Internet Systems University of Washington 1/26/12 16:36 1 Ranking Search Results TF / IDF or BM25 Tag Information Title, headers Font Size / Capitalization Anchor Text on

### Introduction In to Info Inf rmation o Ret Re r t ie i v e a v l a LINK ANALYSIS 1

LINK ANALYSIS 1 Today s lecture hypertext and links We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of

### Lec 8: Adaptive Information Retrieval 2

Lec 8: Adaptive Information Retrieval 2 Advaith Siddharthan Introduction to Information Retrieval by Manning, Raghavan & Schütze. Website: http://nlp.stanford.edu/ir-book/ Linear Algebra Revision Vectors:

### The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems Roberto Tempo IEIIT-CNR Politecnico di Torino tempo@polito.it This talk The objective of this talk is to discuss

### Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

### Link Analysis in the Cloud

Cloud Computing Link Analysis in the Cloud Dell Zhang Birkbeck, University of London 2017/18 Graph Problems & Representations What is a Graph? G = (V,E), where V represents the set of vertices (nodes)

### Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!

Lecture 11: Graph algorithms!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the scenes of MapReduce:

### Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk Lent

### Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Lecture 12: Link Analysis January 28 th, 2016 Wolf-Tilo Balke and Younes Ghammad Institut für Informationssysteme Technische Universität Braunschweig An Overview

### Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 5: Analyzing Graphs (2/2) February 2, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

### CS6120: Intelligent Media Systems. Web Search. Web Search 19/01/2014. Dr. Derek Bridge School of Computer Science & Information Technology UCC

CS6120: Intelligent Media Systems Dr. Derek Bridge School of Computer Science & Information Technology UCC Web Search Napoleon Waterloo Web Search 1 Web Search is Special Size of web Decentralized content

### Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Matrix-Vector Multiplication by MapReduce From Rajaraman / Ullman- Ch.2 Part 1 Google implementation of MapReduce created to execute very large matrix-vector multiplications When ranking of Web pages that

### Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Some Interesting Applications of Theory PageRank Minhashing Locality-Sensitive Hashing 1 PageRank The thing that makes Google work. Intuition: solve the recursive equation: a page is important if important

### Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

### Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL Web mining - Outline Introduction Web Content Mining Web usage

### Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

### Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Large-Scale Networks PageRank Dr Vincent Gramoli Lecturer School of Information Technologies Introduction Last week we talked about: - Hubs whose scores depend on the authority of the nodes they point

### COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

### ~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~

. Search Engines, history and different types In the beginning there was Archie (990, indexed computer files) and Gopher (99, indexed plain text documents). Lycos (994) and AltaVista (995) were amongst

### Social Network Analysis

Chirayu Wongchokprasitti, PhD University of Pittsburgh Center for Causal Discovery Department of Biomedical Informatics chw20@pitt.edu http://www.pitt.edu/~chw20 Overview Centrality Analysis techniques

### CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling

### Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Seminar: Future Of Web Search University of Saarland Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood Tutor : Klaus Berberich Date : 17-Jan-2008 The Agenda

### Bruno Martins. 1 st Semester 2012/2013

Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4

### Anti-Trust Rank for Detection of Web Spam and Seed Set Expansion

International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 4 (2013), pp. 241-250 International Research Publications House http://www. irphouse.com /ijict.htm Anti-Trust

### Informa(on Retrieval

Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 12: Crawling and Link Analysis 2 1 Ch. 11-12 Last Time Chapter 11 1. ProbabilisCc Approach to Retrieval / Basic Probability Theory

### Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search

### Information Retrieval Spring Web retrieval

Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

### Page Rank Algorithm. May 12, Abstract

Page Rank Algorithm Catherine Benincasa, Adena Calden, Emily Hanlon, Matthew Kindzerske, Kody Law, Eddery Lam, John Rhoades, Ishani Roy, Michael Satz, Eric Valentine and Nathaniel Whitaker Department of

### Similarity Ranking in Large- Scale Bipartite Graphs

Similarity Ranking in Large- Scale Bipartite Graphs Alessandro Epasto Brown University - 20 th March 2014 1 Joint work with J. Feldman, S. Lattanzi, S. Leonardi, V. Mirrokni [WWW, 2014] 2 AdWords Ads Ads

### Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

### Information Retrieval May 15. Web retrieval

Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

### A brief history of Google

the math behind Sat 25 March 2006 A brief history of Google 1995-7 The Stanford days (aka Backrub(!?)) 1998 Yahoo! wouldn't buy (but they might invest...) 1999 Finally out of beta! Sergey Brin Larry Page

### A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

### Application of PageRank Algorithm on Sorting Problem Su weijun1, a

International Conference on Mechanics, Materials and Structural Engineering (ICMMSE ) Application of PageRank Algorithm on Sorting Problem Su weijun, a Department of mathematics, Gansu normal university

### A Survey on Web Information Retrieval Technologies

A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University Overview Web Information

### .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Link Analysis in Graphs: PageRank Link Analysis Graphs Recall definitions from Discrete math and graph theory. Graph. A graph

### Social and Technological Network Analysis. Lecture 5: Web Search and Random Walks. Dr. Cecilia Mascolo

Social and Technological Network Analysis Lecture 5: Web Search and Random Walks Dr. Cecilia Mascolo In This Lecture We describe the concept of search in a network. We describe powerful techniques to enhance

### CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

CSE 190 Lecture 16 Data Mining and Predictive Analytics Small-world phenomena Another famous study Stanley Milgram wanted to test the (already popular) hypothesis that people in social networks are separated

### Link Spam Detection Based on Mass Estimation

Link Spam Detection Based on Mass Estimation October 31, 2005 (Revised: June 8, 2006) Technical Report Zoltán Gyöngyi Computer Science Department Stanford University, Stanford, CA 94305 Pavel Berkhin Yahoo!

### Informa(on Retrieval

Introduc*on to Informa(on Retrieval CS276 Informa*on Retrieval and Web Search Chris Manning and Pandu Nayak Link analysis Today s lecture hypertext and links We look beyond the content of documents We

### Information Retrieval

Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Link analysis 1 Today s lecture hypertext and links We look beyond the

### Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

Lecture 04.02 Map-Reduce Algorithms By Marina Barsky Winter 2017, University of Toronto Example 1: Language Model Statistical machine translation: Need to count number of times every 5-word sequence occurs

### c 2006 Society for Industrial and Applied Mathematics

SIAM J. SCI. COMPUT. Vol. 27, No. 6, pp. 2112 212 c 26 Society for Industrial and Applied Mathematics A REORDERING FOR THE PAGERANK PROBLEM AMY N. LANGVILLE AND CARL D. MEYER Abstract. We describe a reordering

### EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS A Project Report Presented to The faculty of the Department of Computer Science San Jose State University In Partial Fulfillment of the Requirements

### Calcolo di PageRank in Google mediante un algoritmo Las Vegas e riflessioni sui metodi randomizzati per sistemi complessi

Calcolo di PageRank in Google mediante un algoritmo Las Vegas e riflessioni sui metodi randomizzati per sistemi complessi Roberto Tempo CNR-IEIIT Consiglio Nazionale delle Ricerche Politecnico di Torino

### Lecture 4: Information Retrieval and Web Mining.

Lecture 4: Information Retrieval and Web Mining http://www.cs.kent.edu/~jin/advdatabases.html 1 1 Outline Information Retrieval Chapter 19 (Database System Concepts) Web Mining (Mining the Web, Soumen

### Heat Kernels and Diffusion Processes

Heat Kernels and Diffusion Processes Definition: University of Alicante (Spain) Matrix Computing (subject 3168 Degree in Maths) 30 hours (theory)) + 15 hours (practical assignment) Contents 1. Solving

### Breadth-First Search Crawling Yields High-Quality Pages

Breadth-First Search Crawling Yields High-Quality Pages Marc Najork Compaq Systems Research Center 13 Lytton Avenue Palo Alto, CA 9431, USA marc.najork@compaq.com Janet L. Wiener Compaq Systems Research

### Graph Theory Review. January 30, Network Science Analytics Graph Theory Review 1

Graph Theory Review Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ January 30, 2018 Network

### Ranking on Data Manifolds

Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname

### Link Analysis Informa0on Retrieval. Evangelos Kanoulas

Link Analysis Informa0on Retrieval Evangelos Kanoulas e.kanoulas@uva.nl How Search Works Logging Clicks Context Crawling Quality Freshness Spaminess Text processing & Indexing Ranking Algorithm Content

### An Adaptive Approach in Web Search Algorithm

International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 15 (2014), pp. 1575-1581 International Research Publications House http://www. irphouse.com An Adaptive Approach

### Pagerank Computation and Keyword Search on Distributed Systems and P2P Networks

Journal of Grid Computing 1: 291 307, 2003. 2004 Kluwer Academic Publishers. Printed in the Netherlands. 291 Pagerank Computation and Keyword Search on Distributed Systems and P2P Networks Karthikeyan

### Using PageRank in Feature Selection

Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

### Using PageRank in Feature Selection

Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

### COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

### The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started

5 Link Analysis Arpan Chakraborty, Kevin Wilson, Nathan Green, Shravan Kumar Alur, Fatih Ergin, Karthik Gurumurthy, Romulo Manzano, and Deepti Chinta North Carolina State University CONTENTS 5.1 Introduction......................................................

### Part I: Data Mining Foundations

Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

### 22. Two-Dimensional Arrays. Topics Motivation The numpy Module Subscripting functions and 2d Arrays GooglePage Rank

22. Two-Dimensional Arrays Topics Motivation The numpy Module Subscripting functions and 2d Arrays GooglePage Rank Visualizing 12 17 49 61 38 18 82 77 83 53 12 10 Can have a 2d array of strings or objects.

### Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation

### Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey