Fast Inbound Top- K Query for Random Walk with Restart

Similar documents
Heterogeneous Graph-Based Intent Learning with Queries, Web Pages and Wikipedia Concepts

Fast Nearest Neighbor Search on Large Time-Evolving Graphs

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Mining Web Data. Lijun Zhang

On Graph Query Optimization in Large Networks

arxiv: v1 [cs.db] 31 Jan 2012

Efficient Mining Algorithms for Large-scale Graphs

Fast Random Walk with Restart: Algorithms and Applications U Kang Dept. of CSE Seoul National University

Relational Retrieval Using a Combination of Path-Constrained Random Walks

Bruno Martins. 1 st Semester 2012/2013

GIVEN a large graph and a query node, finding its k-

Citation Prediction in Heterogeneous Bibliographic Networks

Mining Web Data. Lijun Zhang

COMP 4601 Hubs and Authorities

Query Independent Scholarly Article Ranking

Model-Driven Matching and Segmentation of Trajectories

Fast Personalized PageRank On MapReduce Authors: Bahman Bahmani, Kaushik Chakrabart, Dong Xin In SIGMOD 2011 Presenter: Adams Wei Yu

Supervised Random Walks

Constructing Popular Routes from Uncertain Trajectories

Outline. Motivation. Traditional Database Systems. A Distributed Indexing Scheme for Multi-dimensional Range Queries in Sensor Networks

Recommender Systems New Approaches with Netflix Dataset

AMAZON.COM RECOMMENDATIONS ITEM-TO-ITEM COLLABORATIVE FILTERING PAPER BY GREG LINDEN, BRENT SMITH, AND JEREMY YORK

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

Similarity Ranking in Large- Scale Bipartite Graphs

Spatial Outlier Detection

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

COMP5331: Knowledge Discovery and Data Mining

Cycles in Random Graphs

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Estimating the Free Region of a Sensor Node

node2vec: Scalable Feature Learning for Networks

EE795: Computer Vision and Intelligent Systems

Outline. Best-first search

Treewidth. Kai Wang, Zheng Lu, and John Hicks. April 1 st, 2015

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Graph Data Management

Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid

Diffusion and Clustering on Large Graphs

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Tag Based Image Search by Social Re-ranking

Scan Matching. Pieter Abbeel UC Berkeley EECS. Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Personalized Information Retrieval

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Direct Matrix Factorization and Alignment Refinement: Application to Defect Detection

Multi-resolution image recognition. Jean-Baptiste Boin Roland Angst David Chen Bernd Girod

BEAR: Block Elimination Approach for Random Walk with Restart on Large Graphs

I/O Efficient Algorithms for Exact Distance Queries on Disk- Resident Dynamic Graphs

Outline. Best-first search

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

CSE 5243 INTRO. TO DATA MINING

Approximate Nearest Neighbor Problem: Improving Query Time CS468, 10/9/2006

Data Dependences and Parallelization

Query Sugges*ons. Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata

Lecture 8: Linkage algorithms and web search

Spectral Clustering. Presented by Eldad Rubinstein Based on a Tutorial by Ulrike von Luxburg TAU Big Data Processing Seminar December 14, 2014

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22

Informed search strategies (Section ) Source: Fotolia

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

User Guided Entity Similarity Search Using Meta-Path Selection in Heterogeneous Information Networks

CS 664 Image Matching and Robust Fitting. Daniel Huttenlocher

arxiv: v1 [cs.si] 2 Dec 2017

Query-Based Outlier Detection in Heterogeneous Information Networks

Recap Randomized Algorithms Comparing SLS Algorithms. Local Search. CPSC 322 CSPs 5. Textbook 4.8. Local Search CPSC 322 CSPs 5, Slide 1

Diffusion and Clustering on Large Graphs

A Comparison of Document Clustering Techniques

CS 771 Artificial Intelligence. Informed Search

Latent Space Model for Road Networks to Predict Time-Varying Traffic. Presented by: Rob Fitzgerald Spring 2017

Learning to Rank Networked Entities

Unsupervised Rank Aggregation with Distance-Based Models

Chapter 2 Basic Structure of High-Dimensional Spaces

Can Active Memory Replace Attention?

Light-weight Contour Tracking in Wireless Sensor Networks

Evaluations of Interactive Segmentation Methods. Yaoyao Zhu

Graph Theory S 1 I 2 I 1 S 2 I 1 I 2

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Search Engine Architecture II

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

Spy vs. Spy: Rumor Source Obfuscation

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Derivative Delay Embedding: Online Modeling of Streaming Time Series

Extracting Information from Complex Networks

ACM MM Dong Liu, Shuicheng Yan, Yong Rui and Hong-Jiang Zhang

Branching Distributional Equations and their Applications

Youtube Graph Network Model and Analysis Yonghyun Ro, Han Lee, Dennis Won

Vivaldi: : A Decentralized Network Coordinate System

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Artificial Intelligence

Fast and Accurate Random Walk with Restart on Dynamic Graphs with Guarantees

CSE151 Assignment 2 Markov Decision Processes in the Grid World

over Multi Label Images

Graph Algorithms: Part 2. Dr. Baldassano Yu s Elite Education

Graph-Based Synopses for Relational Data. Alkis Polyzotis (UC Santa Cruz)

Supplementary Material

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Introduction to Systems Biology II: Lab

Transcription:

Fast Inbound Top- K Query for Random Walk with Restart Chao Zhang, Shan Jiang, Yucheng Chen, Yidan Sun, Jiawei Han University of Illinois at Urbana Champaign czhang82@illinois.edu 1

Outline Background Our Methods Experiments Summary 2

Background Random walk with restart (RWR) is a very popular graph proximity measure. It is widely used in various applications: link prediction Web search graph clustering and many others 3

Random Walk with Restart The RWR process A user starts random walk from node v At each step, the user jumps back to v with probability c (0 < c < 1) and continues walking with probability 1 - c The RWR from v to any node u is the probability that the user resides on u after infinite steps P=0.4 P=0.6 0.6 v 0.4 u RWR captures the holistic graph structure and is robust to noise. 4

Top- K Query for RWR Prior art: outbound top- K query Given a query node q, which K nodes have the largest RWR scores from q? Our work: inbound top- K query Given a query node q, which K nodes have the largest RWR scores to q? 5

A motivating example Why Inbound? A traffic network: each node is a road intersection; each edge is a road segment (the number denotes the traffic volume). What are the nodes from which the traffic flows to q and causes traffic jams there? The inbound top-k query identify the source nodes that have a large amount of information flowing to a query node. 6

Existing Methods are NOT Fast Enough Can we use existing methods for answering inbound top- K queries? Existing techniques can compute the RWR for any node pair in O(n) time, but it is time- consuming to compute all the RWRs to the query node q. Branch- and- bound methods have been proposed for outbound top- K queries, but the bounds are not applicable for inbound top- K queries. [1] Fujiwara, Y., Nakatsuji, M., Onizuka, M., Kitsuregawa, M.: Fast and exact top-k search for random walk with restart. PVLDB, 2012. [2] Gupta, M.S., Pathak, A., Chakrabarti, S.: Fast algorithms for top-k personalized pagerank queries. In: WWW. pp. 1225 1226 (2008) 7

Outline Background Our Methods Experiments Summary 8

Overview of Our Methods We propose two methods for processing the inbound top- K query: Squeeze and Ripple. Highlights of the two methods: Squeeze is a global method, while Ripple is a local one; Both are guaranteed to obtain the correct top- k results, without any pre- computation. 9

The Decomposition Theorem The RWR from u to q is the linear combination of the RWRs of u s out- neighbors, with extra score if u = q: r u (q) : the RWR score from u to q O u p uv : u s out-neighbors : the transition probability from u to neighbor v [1] Jeh,G., Widom,J.: Scaling personalized web search. In WWW 2003. 10

The Squeeze Method With the Decomposition Theorem, the RWR scores from all nodes to q is the solution to the following linear system: x q A e q x q (u) : the RWR vector where is the RWR from u to q : the transition matrix : a length-n unit vector where the element of q is 1 11

The Squeeze Method Directly solving the linear system has time complexity O(n^3), which is too expensive for large graphs. Key idea of Squeeze: to find the top- K results, it is unnecessary to obtain the exact RWR scores. If we have lower and upper bounds for each node, then we can obtain the top- K results at a cheaper cost. 12

The Squeeze Method Squeeze obtains the top- K results iteratively: Set the lower bound vector to Update the lower bound vector with For each node u, the lower bound is, and the upper bound is The gap between the lower bound and upper bound is, which shrinks exponentially with the number of iterations. When the gap is small enough to produce the top- K results, Squeeze terminates the iteration and returns the results. 13

The Ripple Method Squeeze saves cost by computing lower and upper bounds for each node, can we do better? Key idea of Ripple: RWR has locality, namely the nodes around q tends to have larger RWR scores. Ripple maintains a vicinity around q and computes RWRs for only the nodes in the vicinity; for the outside nodes, Ripple maintains a uniform upper bound. 14

The Ripple Method Initialization: X q (q) =c, N q = {q},b q = {q} The vicinity around q The boundary of the vicinity The RWR for any node not in the vicinity is set to 0. 15

Iterative update: The Ripple Method Expand the vicinity from the large- score boundary nodes Propagate the RWR scores among the vicinity nodes Derive a uniform upper bound for the outside nodes The lower and upper bounds for the inside nodes are refined as more nodes are included in the vicinity. The RWR upper bound for outside nodes is determined by the boundary nodes that has the largest score. 16

The Ripple Method The vicinity expansion continues until the approximate scores in vicinity are good enough to produce the top- K results. Ripple saves cost as it never access the nodes outside the vicinity, but only keeps a uniform upper bound for them. 17

Outline Background Our Methods Experiments Summary 18

Experiments Data sets Wiki: ~4 million nodes and ~100million edges. Each node is a Wikipedia page, each edge is the link from one page to anther. Foursquare: ~50k nodes and ~120k edges. Each nodes is a Foursquare venue, each edge is people s transition from one place to another. Compared method LU: state- of- the art method for RWR computation, it computes the RWR scores to q for all the nodes and then select the top- K nodes. 19

Inbound Top- K v.s. Outbound Top- K Illustrative examples on Foursquare Outbound top- k queries tend to retrieve famous venues Inbound top- k queries retrieve less famous but highly correlated venues for the query. 20

Inbound Top- K v.s. Outbound Top- K Illustrative examples on Wikipedia Similarly, outbound top- k queries retrieve prestigious pages Inbound top- k queries retrieve more correlated pages 21

Running Time Varying K and c: Observations 1. Both method are much faster than the baseline. 2. Ripple is suitable for extremely large graphs. 22

Outline Background Our Methods Experiments Summary 23

Summary We introduced a new top- K query based on random walk with restart. We proposed two efficient query processing methods Squeeze: it performs power iteration to obtain approximate scores and the top- K results. Ripple: it uses locality to obtain top- K results without accessing faraway nodes. The experiments show that the inbound top- K query retrieves interesting results, and the two methods have excellent efficiency. 24

Thank You! 25