Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Similar documents
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Link Analysis and Web Search

Bruno Martins. 1 st Semester 2012/2013

CS6200 Information Retreival. The WebGraph. July 13, 2015

Information Retrieval. Lecture 11 - Link analysis

Searching the Web [Arasu 01]

Information Retrieval and Web Search Engines

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

COMP5331: Knowledge Discovery and Data Mining

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Social Network Analysis

COMP 4601 Hubs and Authorities

Web Structure Mining using Link Analysis Algorithms

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Bibliometrics: Citation Analysis

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Information Retrieval and Web Search

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

DATA MINING - 1DL105, 1DL111

Authoritative Sources in a Hyperlinked Environment

Lecture 17 November 7

Mining Web Data. Lijun Zhang

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

How to organize the Web?

Information Retrieval. Lecture 4: Search engines and linkage algorithms

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Searching the Web What is this Page Known for? Luis De Alba

PageRank and related algorithms

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

An Improved Computation of the PageRank Algorithm 1

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval

Link Analysis in Web Information Retrieval

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Mining Web Data. Lijun Zhang

DATA MINING II - 1DL460. Spring 2014"

Recent Researches on Web Page Ranking

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Part 1: Link Analysis & Page Rank

CS 347 Parallel and Distributed Data Processing

Big Data Analytics CSCI 4030

CS 347 Parallel and Distributed Data Processing

Extracting Information from Complex Networks

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

Information Networks: PageRank

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

Information Retrieval Spring Web retrieval

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

Link Analysis. Hongning Wang

Big Data Analytics CSCI 4030

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

TODAY S LECTURE HYPERTEXT AND

The PageRank Citation Ranking: Bringing Order to the Web

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Introduction to Data Mining

An Adaptive Approach in Web Search Algorithm

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff Dr Ahmed Rafea

Degree Distribution: The case of Citation Networks

Part I: Data Mining Foundations

Administrative. Web crawlers. Web Crawlers and Link Analysis!

On Finding Power Method in Spreading Activation Search

Information Retrieval May 15. Web retrieval

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

CSI 445/660 Part 10 (Link Analysis and Web Search)

DATA MINING II - 1DL460. Spring 2017

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

The application of Randomized HITS algorithm in the fund trading network

Deep Web Crawling and Mining for Building Advanced Search Application

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

Abstract. 1. Introduction

Special-topic lecture bioinformatics: Mathematics of Biological Networks

Collaborative filtering based on a random walk model on a graph

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Dynamic Visualization of Hubs and Authorities during Web Search

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

Networks in economics and finance. Lecture 1 - Measuring networks

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Title: Artificial Intelligence: an illustration of one approach.

Lecture 8: Linkage algorithms and web search

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

CS 6604: Data Mining Large Networks and Time-Series

Exam IST 441 Spring 2013

Word Disambiguation in Web Search

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

CS425: Algorithms for Web Scale Data

Personalized Information Retrieval

HIPRank: Ranking Nodes by Influence Propagation based on authority and hub

Slides based on those in:

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Transcription:

Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1

WWW retrieval Two approaches 1. Document collection (a) Collect documents (b) Match terms 2. Citation graphs and social network (a) Collect documents (b) Apply graph analysis (c) Determine page rank by prestige indexes April 10, 2003 Page 2

Building a WWW collection: Crawling WWW 1. Distributed collection (a) Lack of centralized archive i. No central list of URLs ii. No or little metadata (b) Lack of homogeneous markup i. Multimedia 2. Crawling ii. Parsing (a) Application walks network (b) Collect documents (c) Parse hyperlinks (d) Traverse hyperlinks: collect new pages April 10, 2003 Page 3

Crawling Issues Scale 1. Rapid expansion of collection 2. Network delays (a) Resolving hostname (b) Server request (c) Checking URL cache 3. Parsing and storing page Validity 1. WWW is still growing (a) Adequate sample? (b) Scalability issues 2. Spider traps 3. Duplication 4. Freshness April 10, 2003 Page 4

Basic Crawler April 10, 2003 Page 5

Perl LWP module dean - johan/src April 10, 2003 Page 6

Power Crawler April 10, 2003 Page 7

Distributed Crawler Architectures April 10, 2003 Page 8

Focused crawling 1. Subject matters (a) Normal crawlers: i. Recall: harvest all ii. Technically not feasible iii. Large overhead for many purposes (b) Restrict scope of crawl to specific subject 2. Approaches (a) Subject classification (b) Page importance, backlink counts (c) Depth-First, Breadth-First, Best-First (d) Use user information: bookmarks (e) Link topology See Mercator: http://research.compaq.com/src/ mercator/research.html April 10, 2003 Page 9

Citation analysis and social networks 1. WWW retrieval problems (a) Size: i. Low recall rates ii. Precision? (b) Heterogeneous: i. jargon ii. junk iii. ranking means little iv. conten matching=bad idea 2. approaches based on social network analysis (a) like humans, some pages are more equal than others (b) find alpha-male of web pages through social structure (c) social network analysis on graph structure of web April 10, 2003 Page 10

Social networks April 10, 2003 Page 11

April 10, 2003 Page 12

Social Networks and bibliometrics Traditional 1. undirected or directed graph 2. edges represent social relationships (a) frequency of conversation (b) rating of friendship (c) phonecalls (d) email (e) co-authorship 3. structural representation of human relationships (a) status (b) Positions Extended: Bibliometrics 1. Any network generated by humans 2. Citation network 3. WWW hyperlinks April 10, 2003 Page 13

Prestige and Power in social networks Prestige and Power 1. Each actor is characterized by structure of relationships 2. High v. low number of relationships (a) high: high prestige (b) low: low prestige 3. metrics of centrality Centrality 1. range of metrics 2. Degree centrality 3. Closeness centrality 4. Betweenness centrality 5. Bonacich centrality April 10, 2003 Page 14

Social networks and centrality directed graph G = (V, E) set of n actors denoted V = {v 0,v 1,...,v n } E represents the edges between pairs of journals; E V 2. Degree centrality c i = n j=0 w i j + n j=0 w ji n j=0 n k=0 w k j (1) Closeness centrality d(v i,v j ) is geodesic distance between v i and v j c i = n 1 i=0 d(v i,v j ) n (2) Betweenness centrality g i jk number of geodesic paths between any pair of nodes (v i,v j ) that passes through node v k b k = g i jk g i j (3) April 10, 2003 Page 15

Bonacich centrality Recursive definition Prestige of actor is function of prestige of actors that related to this actor. Assume vector p, entries represent p(v) represent prestige of node v Adjacency matrix is E so that E(i, j) = 1 only when (v i,v j V 2. Given a certain initial p: p = E T p To determine fixpoint for prestige vector, initialize p = (1,1,,1) and iteratively p E T p (normalizing values!) Convergent solution is principal eigenvector. Eigenvector centrality A little like paths are traversed starting from every possible position. Nodes that are visited most often: highest prestige. April 10, 2003 Page 16

Bibliometrics Impact Factor Citation graph over journals. Impact Factor for journal i in year x: IF(i,x) F x (v i ) is frequency by which articles in journal v i are cited in 2 years before x. n number of articles published in journal i over two year period before x IF(i,x) = F x(v i ) n 1. Impact determination of journals Uses 2. Evaluation of publications, research terms, etc. 3. Scientometrics: determining structure of scientific efforts April 10, 2003 Page 17

Applications to WWW WWW is directed graph: 1. Determine prestige of pages 2. Rank retrieval results in search engine according to page prestige 3. Only most authorative pages presented to user HITS 1. Query isolates subset 2. adjancency matrix is determined 3. Authorities and Hubs PageRank 1. Each page on WWW has prestige value 2. Query selects subset of collection 3. Ranked according to query April 10, 2003 Page 18

HITS 1. developed by Jon Kleinberg (Cornell, IBM Almaden) 2. Basic idea: (a) Prestige or influence has two flavors i. Authorities: high quality information ii. Hubs: comprehensive list of links to authorities (b) Recursive definitions i. Relations to Bonacich or eigenvector centrality ii. Every page has two degrees of hubness and authoritativeness April 10, 2003 Page 19

Hubs and Authorities April 10, 2003 Page 20

Authority and Hub scores Authority : sum of hub score of pages linking to it Hub : sum of authority score of pages to which it links Hub scores over all pages in collection or subset corresponding to query: h Authority scores over all pages in collection or subset corresponding to query: a vech = E a (4) veca = E t h (5) Conversion of vech and veca results in set of authority and hubness values for pages April 10, 2003 Page 21

HITS procedure Query first approach 1. Query: text based IR system 2. Expand root set by radius one 3. power iterations on hubs and authorities 4. Report top ranking hubs and authorities Topic Distillations 1. No pre-computation of scores 2. As in real-life situations hubs are most useful 3. In WWW: hubs acquire authority: need for hub score? April 10, 2003 Page 22

PageRank 1. Introduced by Brin and Page, Stanford 2. Notion of Random Walk 3. Start user at random positions (a) probability of finding users at page X after n clicks (b) recursive procedure relating to infinite, random walk 4. principle of eigenvector centrality: p 1 = L t p 0 5. problem: (a) Web is not strongly connected (b) leaf nodes and cycles Principle of pre-calculation 1. Once WWW subgraph is know: each page estimated prestige 2. Query isolates documents, ranked according to prestige April 10, 2003 Page 23

PageRank: problems 1. Hyperlink design (a) Hyperlink authority or influence? (b) WWW: designed by humans? 2. Nepotism (a) Pages on same web site or conglomerate link to each other (b) Can warp pagerank 3. Clique attacks (a) Deliberate attempt to link clique of pages (b) Case of Search King v. Google April 10, 2003 Page 24

Clique attack April 10, 2003 Page 25

Webology discussion April 10, 2003 Page 26