Hierarchical Link Analysis for Ranking Web Data

Hierarchical Link Analysis for Ranking Web Data Renaud Delbru, Nickolai Toupikov, Michele Catasta, Giovanni Tummarello, and Stefan Decker Digital Enterprise Research Institute, Galway June 1, 2010

Introduction Web of Data There is a growing increase of web data sources... Linked Open Data cloud; Open Graph protocol; e-commerces (good relations), e-government,... How to search and retrieve relevant information? One single query can return million of entities...... and users expect only the most relevant ones. Web data search engines (e.g., Sindice) need effective way to rank entities. Partial solution: Popularity-based entity ranking. 1 / 36

Link Analysis on the Web Link Analysis Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j Link Analysis for Web Documents PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure Link Analysis for Web Data Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view 2 / 36

Outline: Web Data Model Web Data Model Web Data Graph Dataset Graph Internal and External Node Intra and Inter-Dataset Edge Linkset Two-Layer Model Quantifying the Two-Layer Model 3 / 36

Web Data Graph Figure: Web data graph 4 / 36

Dataset Graph Figure: Dataset graph 5 / 36

Internal and External Node Figure: Internal (red) and external nodes (blue) 6 / 36

Intra and Inter-Dataset Edge Figure: Inter-dataset (orange) and intra-dataset (black) edges 7 / 36

Linkset Figure: Linkset 8 / 36

Two-Layer Model Figure: Two-layer model of the Web of Data 9 / 36

Quantifying the two-layer model Datasets DBpedia 17.7 million of entities Citeseer (RKBExplorer) 2.48 million of entities Geonames 13.8 million of entities Sindice 60 million of entities among 50.000 datasets Dataset Intra Inter DBpedia 88M (93.2%) 6.4M (6.8%) Citeseer 12.9M (77.7%) 3.7M (22.3%) Geonames 59M (98.3%) 1M (1.7%) Sindice 287M (78.8%) 77M (21.2%) Table: Ratio intra / inter dataset links 10 / 36

Outline: The DING Model The DING Model Overview Unsupervised Link Weighting Computing DatasetRank Computing Local EntityRank Combining Dataset Rank and Entity Rank 11 / 36

The DING Model: Overview DING Principles DING performs entity ranking in three steps: 1 dataset ranks are computed by performing link analysis on the top layer (i.e. the dataset graph); 2 for each dataset, entity ranks are computed by performing link analysis on the local entity collection; 3 the popularity of the dataset is propagated to its entities and combined with their local ranks to estimate a global entity rank. 12 / 36

Unsupervised Link Weighting Intuition TF-IDF applied on link labels Link Frequency - Inverse Dataset Frequency (LF-IDF) Link weighting factor w σ,i,j Assign low weight to very common links, such as rdfs:seealso w σ,i,j = LF (L σ,i,j ) IDF (σ) = L σ,i,j Lτ,i,k L τ,i,k log N 1 + freq(σ) 13 / 36

Computing Dataset Rank Assumption Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank 16 / 36

Computing Dataset Rank Assumption Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank DatasetRank Weighted PageRank on the weighted dataset graph 17 / 36

Computing Dataset Rank Assumption Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank DatasetRank Weighted PageRank on the weighted dataset graph Distribution factor w σ,i,j is defined by LF-IDF r k (D j ) = α r k 1 E Dj (D i )w σ,i,j + (1 α) Lσ,i,j D G E D 18 / 36

Computing Dataset Rank Assumption Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank DatasetRank Weighted PageRank on the weighted dataset graph Distribution factor w σ,i,j is defined by LF-IDF Probability of random jump is proportional to the size of a dataset r k (D j ) = α r k 1 E Dj (D i )w σ,i,j + (1 α) Lσ,i,j D G E D 19 / 36

Computing Local EntityRank Generic Algorithms Weighted EntityRank: Weighted PageRank applied on the internal entities and intra-links of a dataset Weighted LinkCount: in-degree counting links applied on the internal entities and intra-links of a dataset 20 / 36

Combining Dataset Rank and Entity Rank Naive approach Purely probabilistic point of view: joint probability Assumption: independent events Global score r g (e) = P(e D) = r(e) r(d) Problem: favours smaller datasets DING Approach Add a local entity rank factor; Normalise local ranks to a same average based on dataset size r g (e) = r(d) r(e) E D D G E D 21 / 36

Outline: Experimental Results Experimental Results Overview User Study SemSearch 2010 22 / 36

Experimental Results: Overview Link Analysis Methods Global EntityRank (GER); Local LinkCount (LLC) and Local EntityRank (LER); Local algorithms combined with DatasetRank (DR-LLC and DR-LER). Experiments 1 User study to evaluate qualitatively each methods; 2 Semantic Search challenge. 23 / 36

User Study: Design Exp-A Exp-B Task Local entity ranking (LER & LLC) on DBpedia dataset 31 participants DING (DR-LER & DR-LLC) on Sindice s page-repository 58 participants 10 queries (keyword and SPARQL queries) One result list (top-10) per algorithm Rate algorithms (W, SW, S, SB, B) in relation to GER 24 / 36

User Study: Questionnaire Figure: One of the questionnaire given to the participant 25 / 36

User Study A: Results (a) LER Rate O i E i %χ 2 B 0 6.2 13% SB 7 6.2 +0% S 21 6.2 +71% SW 3 6.2 3% W 0 6.2 13% Totals 31 31 (b) LLC Rate O i E i %χ 2 B 3 6.2 12% SB 8 6.2 +4% S 13 6.2 +53% SW 6 6.2 0% W 1 6.2 31% Totals 31 31 Table: Chi-square test for Exp-A. The column %χ 2 gives, for each modality, its contribution to χ 2 (in relative value). Conclusion LER and LLC provides similar results than GER. However, there is a more significant proportion of the population that considers LER more similar to GER. 26 / 36

User Study B: Results (a) DR-LER Rate O i E i %χ 2 B 12 11.6 +0% SB 12 11.6 +0% S 22 11.6 +57% SW 9 11.6 4% W 3 11.6 39% Totals 58 58 (b) DR-LLC Rate O i E i %χ 2 B 7 11.6 9% SB 24 11.6 +65% S 13 11.6 +1% SW 10 11.6 1% W 4 11.6 24% Totals 58 58 Table: Chi-square test for Exp-B. The column %χ 2 gives, for each modality, its contribution to χ 2 (in relative value). Conclusion It appears that DR-LLC provides a better effectiveness. A large proportion of the population finds it slightly better than GER, and this is reinforced by a few number of people finding it worse. 27 / 36

SemSearch 2010: Entity Search Track SemSearch 2010 First semantic search evaluation; Focus on entity search. Experiment Design Billion Triple Challenge 2009 dataset; 92 keyword queries; Relevance judgement on top 10 entities. 28 / 36

SemSearch 2010: Experiment Results Figure: SemSearch 2010 evaluation results 29 / 36

Scalability: Computing Dataset Rank Graph Node Edge Web Data 60M 364M Dataset 50K 1.2M Table: Graph Size DatasetRank 1 iteration 200ms; Good quality rank in few seconds. 30 / 36

Scalability: Dataset size distribution Power-law distribution; The majority of the datasets contain less than 1000 nodes. 31 / 36

Scalability: Computing Entity Rank EntityRank 55 iterations of 1 minute (for DBPedia dataset). LinkCount requires only 1 iteration; can be computed on the fly with appropriate data index. 32 / 36

Dataset-Dependent Local EntityRank Dataset Specific Algorithms No reason to have one generic algorithm for all datasets; We could choose appropriate entity ranking algorithm for each dataset. Graph Structure Dataset Algorithm Generic, Controlled DBpedia LinkCount Generic, Open Social Communities EntityRank Hierarchical Geonames, Taxonomies DHC Bipartite DBLP CiteRank Table: List of various graph structures with appropriate algorithms 33 / 36

Conclusion DING Method Hierarchical Link Analysis for web data; Quality comparable or even better than standard approaches; Lower computational complexity; Dataset-dependent local entity ranking. Future Work Investigate how to detect appropriate local entity ranking method for a dataset; Study query-dependent ranking and how it can be combined with DING ranking. 36 / 36