Bibliometrics: Citation Analysis

Similar documents
5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

COMP 4601 Hubs and Authorities

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Authoritative Sources in a Hyperlinked Environment

Recent Researches on Web Page Ranking

PageRank and related algorithms

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Information Retrieval and Web Search

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Information Retrieval and Web Search Engines

Searching the Web [Arasu 01]

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

How to organize the Web?

CSI 445/660 Part 10 (Link Analysis and Web Search)

Bruno Martins. 1 st Semester 2012/2013

Mining Web Data. Lijun Zhang

Lecture 17 November 7

Information Networks: Hubs and Authorities

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Extracting Information from Complex Networks

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Information Retrieval: Retrieval Models

Information Retrieval. Lecture 11 - Link analysis

A ew Algorithm for Community Identification in Linked Data

Social Network Analysis

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

Link Analysis and Web Search

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

This lecture. Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis.

TODAY S LECTURE HYPERTEXT AND

COMP5331: Knowledge Discovery and Data Mining

Temporal Graphs KRISHNAN PANAMALAI MURALI

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

The application of Randomized HITS algorithm in the fund trading network

Information Retrieval. hussein suleman uct cs

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Personalized Information Retrieval

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Authoritative Sources in a Hyperlinked Environment

Abstract. 1. Introduction

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

60-538: Information Retrieval

Degree Distribution: The case of Citation Networks

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Searching the Web What is this Page Known for? Luis De Alba

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

CS 6320 Natural Language Processing

Link Analysis. Hongning Wang

Popularity Weighted Ranking for Academic Digital Libraries

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Social Network Analysis

Lec 8: Adaptive Information Retrieval 2

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Introduction to Data Mining

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Mining Web Data. Lijun Zhang

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Scopus. Information literacy in Chemistry. J une 14, 2011

Scuola di dottorato in Scienze molecolari Information literacy in chemistry 2015 SCOPUS

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

On Finding Power Method in Spreading Activation Search

Part 1: Link Analysis & Page Rank

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

Clustering. Bruno Martins. 1 st Semester 2012/2013

Link Analysis in Web Mining

Approaches to Mining the Web

Social Network Analysis

Link Analysis in Web Information Retrieval

Topic Exploration and Distillation for Web Search by a Generalized Similarity Analysis

Lecture 8: Linkage algorithms and web search

Information Networks: PageRank

Introduction to Information Retrieval

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Chapter 6: Information Retrieval and Web Search. An introduction

How to Use Widener

Information retrieval

Information Retrieval. (M&S Ch 15)

Lecture 8 May 7, Prabhakar Raghavan

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

The IAC s Publications Archive. Monique Gómez & Jorge A. Pérez Prieto Instituto de Astrofísica de Canarias Tenerife, Spain

Spectral filtering for resource discovery

2- Access ScienceDirect?

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

CS 6604: Data Mining Large Networks and Time-Series

Finding Authoritative Web Pages Using the Web's Link Structure (David Wagner)

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Transcription:

Bibliometrics: Citation Analysis Many standard documents include bibliographies (or references), explicit citations to other previously published documents. Now, if you consider citations as links, academic papers can be viewed as a creating a graph. The structure of this graph, independent of content, can provide interesting information about the similarity of documents and the structure of information. citeseer.ist/psu.edu, scholar.google.com 1

Impact Factor Developed by Garfield in 1972 to measure the importance (quality, influence) of scientific journals. Measure of how often papers in the journal are cited by other scientists. Computed and published annually by the Institute for Scientific Information (ISI). The impact factor of a journal J in year Y is the average number of citations (from indexed documents published in year Y) to a paper published in J in year Y 1 or Y 2. Does not account for the quality of the citing article. 2

Bibliographic Coupling Measure of similarity of documents introduced by Kessler in 1963. The bibliographic coupling of two documents A and B is the number of documents cited by both A and B. Size of the intersection of their bibliographies. Maybe want to normalize by size of bibliographies? A B 3

Co-Citation An alternate citation-based measure of similarity introduced by Small in 1973. Number of documents that cite both A and B. Maybe want to normalize by total number of documents citing either A or B? A B 4

Citations vs. Links Web links are a bit different than citations: Many links are navigational. Many pages with high in-degree are portals not content providers. Not all links are endorsements. Company websites don t point to their competitors. Citations to relevant literature is enforced by peer-review. 5

Hubs & Authorities Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. In-degree (number of pointers to a page) is one simple measure of authority. However in-degree treats all links as equal. Should links from pages that are themselves authoritative count more? Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). 6

HITS Algorithm developed by Kleinberg in 1998. Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web. Based on mutually recursive facts: Hubs point to lots of authorities. Authorities are pointed to by lots of hubs. 7

HITS Algorithm Computes hubs and authorities for a particular topic specified by a normal query. First determines a set of relevant pages for the query called the base set S. Analyze the link structure of the web subgraph defined by S to find authority and hub pages in this set. 8

Constructing a Base Subgraph For a specific query Q, let the set of documents returned by a standard search engine be called the root set R. Initialize S to R. Add to S all pages pointed to by any page in R. Add to S all pages that point to any page in R. S R 9

Base Limitations To limit computational expense: Limit number of root pages to the top 200 pages retrieved for the query. Limit number of back-pointer pages to a random set of at most 50 pages returned by a reverse link query. To eliminate purely navigational links: Eliminate links between two pages on the same host. To eliminate non-authority-conveying links: Allow only m (m = 4-8) pages from a given host as pointers to any individual page. 10

Authorities and In-Degree, Assumptions Even within the base set S for a given query, the nodes with highest in-degree are not necessarily authorities (may just be generally popular pages like Yahoo or Amazon). True authority pages are pointed to by a number of hubs (i.e. pages that point to lots of authorities). 11

HITS Iterative Algorithm Initialize for all p in S: a p = h p = 1 For i = 1 to k: For all p in S: scores) a p = q: q p h p = q: p q h q a q For all p in S: For all p in S: a p = a p /c c: For all p in S: h p = h p /c c: (update auth. (update hub scores) p S p S a p /c 2 =1 h p /c 2 =1 (normalize a) (normalize h) 12

Convergence Algorithm converges to a fix-point if iterated indefinitely. Define A to be the adjacency matrix for the subgraph defined by S. A ij = 1 if there is a link from node i to node j, else 0 Authority vector, a, converges to the principal eigenvector of A T A Hub vector, h, converges to the principal eigenvector of AA T In practice, 20 iterations produces fairly stable results. 13

Results Authorities for query: Java java.sun.com comp.lang.java FAQ Authorities for query search engine Yahoo.com Excite.com Lycos.com Altavista.com Authorities for query Gates Microsoft.com roadahead.com 14

Result Comments In most cases, the final authorities were not in the initial root set generated using Altavista. Authorities were brought in from linked and reverse-linked pages and then HITS computed their high authority score. 15

Finding Similar Pages Using Link Structure Given a page, P, let R (the root set) be t (e.g. 200) pages that point to P. Grow a base set S from R. Run HITS on S. Return the best authorities in S as the best similar-pages for P. Finds authorities in the link neighborhood of P. 16

Given honda.com toyota.com ford.com bmwusa.com saturncars.com nissanmotors.com audi.com volvocars.com Similar Page Results 17

HITS for Clustering An ambiguous query can result in the principal eigenvector only covering one of the possible meanings. Non-principal eigenvectors may contain hubs & authorities for other meanings. Example: jaguar : Atari video game (principal eigenvector) NFL Football team (2 nd non-princ. eigenvector) Automobile (3 rd non-princ. eigenvector) 18

HITS improvements weight links according to similarity of hyperlinked text Weight documents on textual similarity Only count links from different hosts 19

HITS analysis On finding high quality pages Precision at 5 10 In Degree 71 57 Authority 69 57 PageRank 69 53 # pages on site 66 56 # of images on page 62 56 # audio on page 52 39 textual relevance 44 53 from does authority mean quality 20

Latent Semantic Indexing Initially developed by Deerwester, Dumais et al Goal to address issues with word-based document retrieval General idea: replace word-based vectorspace model with something else Presumably something better What is that something? 21

22

23

Word Occurrence Matrix 24

25

26

SVD of previous example 27

28

Note the can But the semantics may not be comprehensible Results are unimpressive The doc to LSI can be dome offline. However, need to do query to LSI then doc*query and cannot use inverted index to cut this computation 29