Hierarchical Link Analysis for Ranking Web Data

Similar documents
Linked Data in the Clouds : a Sindice.com perspective

SWSE: Objects before documents!

Tag-based Social Interest Discovery

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Identifying Relevant Sources for Data Linking using a Semantic Web Index

Chapter 27 Introduction to Information Retrieval and Web Search

Sindice.com: Weaving the open linked data. Tummarello, Giovanni; Delbru, Renaud; Oren, Eyal

Searching Web Data: an Entity Retrieval and High-Performance Indexing Model

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Studying the Impact of Text Summarization on Contextual Advertising

LODatio: A Schema-Based Retrieval System forlinkedopendataatweb-scale

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

The Data Web and Linked Data.

Linked Data. Department of Software Enginnering Faculty of Information Technology Czech Technical University in Prague Ivo Lašek, 2011

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Entity and Knowledge Base-oriented Information Retrieval

Jianyong Wang Department of Computer Science and Technology Tsinghua University

On Measuring the Lattice of Commonalities Among Several Linked Datasets

Tansu Alpcan C. Bauckhage S. Agarwal

Semantic and Distributed Entity Search in the Web of Data

NATURAL LANGUAGE PROCESSING

Overview of Web Mining Techniques and its Application towards Web

SOFIA: Social Filtering for Niche Markets

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Exploring and Using the Semantic Web

Intuitive and Interactive Query Formulation to Improve the Usability of Query Systems for Heterogeneous Graphs

CS/INFO 1305 Information Retrieval

Similarity Ranking in Large- Scale Bipartite Graphs

Hogan, Aidan; Harth, Andreas; Decker, Stefan

PRISM: Concept-preserving Social Image Search Results Summarization

W3C Workshop on the Future of Social Networking, January 2009, Barcelona

Diffusion and Clustering on Large Graphs

Feature selection. LING 572 Fei Xia

Query Expansion using Wikipedia and DBpedia

A Distributional Approach for Terminological Semantic Search on the Linked Data Web

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Information Retrieval and Web Search

Query Independent Scholarly Article Ranking

Performance and cost effectiveness of caching in mobile access networks

Keyword query interpretation over structured data

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Search Ranking for Heterogeneous Data over Dataspace

A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles

SIREn: Entity Retrieval System for the Web of Data

CS/INFO 1305 Summer 2009

Triple Indexing: An Efficient Technique for Fast Phrase Query Evaluation

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Information Retrieval

Sindice Widgets: Lightweight embedding of Semantic Web capabilities into existing user applications.

Link Analysis and Web Search

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Semantic Cloud Generation based on Linked Data for Efficient Semantic Annotation

Open Data Integration. Renée J. Miller

Project Report on winter

Developing Focused Crawlers for Genre Specific Search Engines

Large Scale Graph Algorithms

Sampling Large Graphs for Anticipatory Analysis

Web Semantics: Science, Services and Agents on the World Wide Web

Visual Representations for Machine Learning

Mining Web Data. Lijun Zhang

CLUSTERING. JELENA JOVANOVIĆ Web:

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala

CSE 573: Artificial Intelligence Autumn 2010

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Payment Systems Statistics

Static Pruning of Terms In Inverted Files

Generation of Semantic Clouds Based on Linked Data for Efficient Multimedia Semantic Annotation

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Information Retrieval. (M&S Ch 15)

Information Retrieval. Information Retrieval and Web Search

CS 224W Final Report Group 37

Information Retrieval

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Link Analysis in the Cloud

Diffusion and Clustering on Large Graphs

60-538: Information Retrieval

Authoritative K-Means for Clustering of Web Search Results

A short introduction to the development and evaluation of Indexing systems

Sig.ma: live views on the Web of Data

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar

University of Maryland. Tuesday, March 2, 2010

Finding Topic-centric Identified Experts based on Full Text Analysis

Query Decomposition: A Multiple Neighborhood Approach to Relevance Feedback Processing in Content-based Image Retrieval

USC Viterbi School of Engineering

Prof. Dr. Christian Bizer

Social Search Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Information Retrieval

SOCIAL MEDIA MINING. Data Mining Essentials

What should I link to? Identifying relevant sources and classes for data linking

Ranking Algorithms For Digital Forensic String Search Hits

Chapter 6: Information Retrieval and Web Search. An introduction

Link Analysis in Web Mining

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Semantic Website Clustering

Introduction to Information Retrieval

Part I: Data Mining Foundations

Transcription:

Hierarchical Link Analysis for Ranking Web Data Renaud Delbru, Nickolai Toupikov, Michele Catasta, Giovanni Tummarello, and Stefan Decker Digital Enterprise Research Institute, Galway June 1, 2010

Introduction Web of Data There is a growing increase of web data sources... Linked Open Data cloud; Open Graph protocol; e-commerces (good relations), e-government,... How to search and retrieve relevant information? One single query can return million of entities...... and users expect only the most relevant ones. Web data search engines (e.g., Sindice) need effective way to rank entities. Partial solution: Popularity-based entity ranking. 1 / 36

Link Analysis on the Web Link Analysis Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j Link Analysis for Web Documents PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure Link Analysis for Web Data Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view 2 / 36

Link Analysis on the Web Link Analysis Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j Link Analysis for Web Documents PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure Link Analysis for Web Data Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view 2 / 36

Link Analysis on the Web Link Analysis Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j Link Analysis for Web Documents PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure Link Analysis for Web Data Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view 2 / 36

Link Analysis on the Web Link Analysis Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j Link Analysis for Web Documents PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure Link Analysis for Web Data Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view 2 / 36

Link Analysis on the Web Link Analysis Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j Link Analysis for Web Documents PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure Link Analysis for Web Data Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view 2 / 36

Outline: Web Data Model Web Data Model Web Data Graph Dataset Graph Internal and External Node Intra and Inter-Dataset Edge Linkset Two-Layer Model Quantifying the Two-Layer Model 3 / 36

Web Data Graph Figure: Web data graph 4 / 36

Dataset Graph Figure: Dataset graph 5 / 36

Internal and External Node Figure: Internal (red) and external nodes (blue) 6 / 36

Intra and Inter-Dataset Edge Figure: Inter-dataset (orange) and intra-dataset (black) edges 7 / 36

Linkset Figure: Linkset 8 / 36

Two-Layer Model Figure: Two-layer model of the Web of Data 9 / 36

Quantifying the two-layer model Datasets DBpedia 17.7 million of entities Citeseer (RKBExplorer) 2.48 million of entities Geonames 13.8 million of entities Sindice 60 million of entities among 50.000 datasets Dataset Intra Inter DBpedia 88M (93.2%) 6.4M (6.8%) Citeseer 12.9M (77.7%) 3.7M (22.3%) Geonames 59M (98.3%) 1M (1.7%) Sindice 287M (78.8%) 77M (21.2%) Table: Ratio intra / inter dataset links 10 / 36

Outline: The DING Model The DING Model Overview Unsupervised Link Weighting Computing DatasetRank Computing Local EntityRank Combining Dataset Rank and Entity Rank 11 / 36

The DING Model: Overview DING Principles DING performs entity ranking in three steps: 1 dataset ranks are computed by performing link analysis on the top layer (i.e. the dataset graph); 2 for each dataset, entity ranks are computed by performing link analysis on the local entity collection; 3 the popularity of the dataset is propagated to its entities and combined with their local ranks to estimate a global entity rank. 12 / 36

The DING Model: Overview DING Principles DING performs entity ranking in three steps: 1 dataset ranks are computed by performing link analysis on the top layer (i.e. the dataset graph); 2 for each dataset, entity ranks are computed by performing link analysis on the local entity collection; 3 the popularity of the dataset is propagated to its entities and combined with their local ranks to estimate a global entity rank. 12 / 36

The DING Model: Overview DING Principles DING performs entity ranking in three steps: 1 dataset ranks are computed by performing link analysis on the top layer (i.e. the dataset graph); 2 for each dataset, entity ranks are computed by performing link analysis on the local entity collection; 3 the popularity of the dataset is propagated to its entities and combined with their local ranks to estimate a global entity rank. 12 / 36

The DING Model: Overview DING Principles DING performs entity ranking in three steps: 1 dataset ranks are computed by performing link analysis on the top layer (i.e. the dataset graph); 2 for each dataset, entity ranks are computed by performing link analysis on the local entity collection; 3 the popularity of the dataset is propagated to its entities and combined with their local ranks to estimate a global entity rank. 12 / 36

Unsupervised Link Weighting Intuition TF-IDF applied on link labels Link Frequency - Inverse Dataset Frequency (LF-IDF) Link weighting factor w σ,i,j Assign low weight to very common links, such as rdfs:seealso w σ,i,j = LF (L σ,i,j ) IDF (σ) = L σ,i,j Lτ,i,k L τ,i,k log N 1 + freq(σ) 13 / 36

Unsupervised Link Weighting Intuition TF-IDF applied on link labels Link Frequency - Inverse Dataset Frequency (LF-IDF) Link weighting factor w σ,i,j Assign low weight to very common links, such as rdfs:seealso w σ,i,j = LF (L σ,i,j ) IDF (σ) = L σ,i,j Lτ,i,k L τ,i,k log N 1 + freq(σ) 14 / 36

Unsupervised Link Weighting Intuition TF-IDF applied on link labels Link Frequency - Inverse Dataset Frequency (LF-IDF) Link weighting factor w σ,i,j Assign low weight to very common links, such as rdfs:seealso w σ,i,j = LF (L σ,i,j ) IDF (σ) = L σ,i,j Lτ,i,k L τ,i,k log N 1 + freq(σ) 15 / 36

Computing Dataset Rank Assumption Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank 16 / 36

Computing Dataset Rank Assumption Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank DatasetRank Weighted PageRank on the weighted dataset graph 17 / 36

Computing Dataset Rank Assumption Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank DatasetRank Weighted PageRank on the weighted dataset graph Distribution factor w σ,i,j is defined by LF-IDF r k (D j ) = α r k 1 E Dj (D i )w σ,i,j + (1 α) Lσ,i,j D G E D 18 / 36

Computing Dataset Rank Assumption Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank DatasetRank Weighted PageRank on the weighted dataset graph Distribution factor w σ,i,j is defined by LF-IDF Probability of random jump is proportional to the size of a dataset r k (D j ) = α r k 1 E Dj (D i )w σ,i,j + (1 α) Lσ,i,j D G E D 19 / 36

Computing Local EntityRank Generic Algorithms Weighted EntityRank: Weighted PageRank applied on the internal entities and intra-links of a dataset Weighted LinkCount: in-degree counting links applied on the internal entities and intra-links of a dataset 20 / 36

Combining Dataset Rank and Entity Rank Naive approach Purely probabilistic point of view: joint probability Assumption: independent events Global score r g (e) = P(e D) = r(e) r(d) Problem: favours smaller datasets DING Approach Add a local entity rank factor; Normalise local ranks to a same average based on dataset size r g (e) = r(d) r(e) E D D G E D 21 / 36

Outline: Experimental Results Experimental Results Overview User Study SemSearch 2010 22 / 36

Experimental Results: Overview Link Analysis Methods Global EntityRank (GER); Local LinkCount (LLC) and Local EntityRank (LER); Local algorithms combined with DatasetRank (DR-LLC and DR-LER). Experiments 1 User study to evaluate qualitatively each methods; 2 Semantic Search challenge. 23 / 36

User Study: Design Exp-A Exp-B Task Local entity ranking (LER & LLC) on DBpedia dataset 31 participants DING (DR-LER & DR-LLC) on Sindice s page-repository 58 participants 10 queries (keyword and SPARQL queries) One result list (top-10) per algorithm Rate algorithms (W, SW, S, SB, B) in relation to GER 24 / 36

User Study: Questionnaire Figure: One of the questionnaire given to the participant 25 / 36

User Study A: Results (a) LER Rate O i E i %χ 2 B 0 6.2 13% SB 7 6.2 +0% S 21 6.2 +71% SW 3 6.2 3% W 0 6.2 13% Totals 31 31 (b) LLC Rate O i E i %χ 2 B 3 6.2 12% SB 8 6.2 +4% S 13 6.2 +53% SW 6 6.2 0% W 1 6.2 31% Totals 31 31 Table: Chi-square test for Exp-A. The column %χ 2 gives, for each modality, its contribution to χ 2 (in relative value). Conclusion LER and LLC provides similar results than GER. However, there is a more significant proportion of the population that considers LER more similar to GER. 26 / 36

User Study B: Results (a) DR-LER Rate O i E i %χ 2 B 12 11.6 +0% SB 12 11.6 +0% S 22 11.6 +57% SW 9 11.6 4% W 3 11.6 39% Totals 58 58 (b) DR-LLC Rate O i E i %χ 2 B 7 11.6 9% SB 24 11.6 +65% S 13 11.6 +1% SW 10 11.6 1% W 4 11.6 24% Totals 58 58 Table: Chi-square test for Exp-B. The column %χ 2 gives, for each modality, its contribution to χ 2 (in relative value). Conclusion It appears that DR-LLC provides a better effectiveness. A large proportion of the population finds it slightly better than GER, and this is reinforced by a few number of people finding it worse. 27 / 36

SemSearch 2010: Entity Search Track SemSearch 2010 First semantic search evaluation; Focus on entity search. Experiment Design Billion Triple Challenge 2009 dataset; 92 keyword queries; Relevance judgement on top 10 entities. 28 / 36

SemSearch 2010: Experiment Results Figure: SemSearch 2010 evaluation results 29 / 36

Scalability: Computing Dataset Rank Graph Node Edge Web Data 60M 364M Dataset 50K 1.2M Table: Graph Size DatasetRank 1 iteration 200ms; Good quality rank in few seconds. 30 / 36

Scalability: Dataset size distribution Power-law distribution; The majority of the datasets contain less than 1000 nodes. 31 / 36

Scalability: Computing Entity Rank EntityRank 55 iterations of 1 minute (for DBPedia dataset). LinkCount requires only 1 iteration; can be computed on the fly with appropriate data index. 32 / 36

Dataset-Dependent Local EntityRank Dataset Specific Algorithms No reason to have one generic algorithm for all datasets; We could choose appropriate entity ranking algorithm for each dataset. Graph Structure Dataset Algorithm Generic, Controlled DBpedia LinkCount Generic, Open Social Communities EntityRank Hierarchical Geonames, Taxonomies DHC Bipartite DBLP CiteRank Table: List of various graph structures with appropriate algorithms 33 / 36

Dataset-Dependent Local EntityRank Dataset Specific Algorithms No reason to have one generic algorithm for all datasets; We could choose appropriate entity ranking algorithm for each dataset. Graph Structure Dataset Algorithm Generic, Controlled DBpedia LinkCount Generic, Open Social Communities EntityRank Hierarchical Geonames, Taxonomies DHC Bipartite DBLP CiteRank Table: List of various graph structures with appropriate algorithms 34 / 36

Dataset-Dependent Local EntityRank Dataset Specific Algorithms No reason to have one generic algorithm for all datasets; We could choose appropriate entity ranking algorithm for each dataset. Graph Structure Dataset Algorithm Generic, Controlled DBpedia LinkCount Generic, Open Social Communities EntityRank Hierarchical Geonames, Taxonomies DHC Bipartite DBLP CiteRank Table: List of various graph structures with appropriate algorithms 35 / 36

Conclusion DING Method Hierarchical Link Analysis for web data; Quality comparable or even better than standard approaches; Lower computational complexity; Dataset-dependent local entity ranking. Future Work Investigate how to detect appropriate local entity ranking method for a dataset; Study query-dependent ranking and how it can be combined with DING ranking. 36 / 36