Answering Label-Constraint Reachability in Large Graphs

Similar documents
COMBINATORIAL PATTERN MATCHING

COMP 423 lecture 11 Jan. 28, 2008

2 Computing all Intersections of a Set of Segments Line Segment Intersection

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

Suffix trees, suffix arrays, BWT

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

Presentation Martin Randers

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming

SAPPER: Subgraph Indexing and Approximate Matching in Large Graphs

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search.

Premaster Course Algorithms 1 Chapter 6: Shortest Paths. Christian Scheideler SS 2018

Ma/CS 6b Class 1: Graph Recap

Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm Peng Zhou, Zhong-min Wang, Zhen-nan Li, Yang Li

Information Retrieval and Organisation

What are suffix trees?

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016

UT1553B BCRT True Dual-port Memory Interface

CS481: Bioinformatics Algorithms

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. General Tree Search. Uniform Cost. Lecture 3: A* Search 9/4/2007

9 Graph Cutting Procedures

Topological Queries on Graph-structured XML Data: Models and Implementations

AI Adjacent Fields. This slide deck courtesy of Dan Klein at UC Berkeley

An Algorithm for Enumerating All Maximal Tree Patterns Without Duplication Using Succinct Data Structure

Fig.25: the Role of LEX

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

Today. Search Problems. Uninformed Search Methods. Depth-First Search Breadth-First Search Uniform-Cost Search

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1):

Efficient K-NN Search in Polyphonic Music Databases Using a Lower Bounding Mechanism

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus

COMPUTER EDUCATION TECHNIQUES, INC. (MS_W2K3_SERVER ) SA:

10.5 Graphing Quadratic Functions

Ma/CS 6b Class 1: Graph Recap

From Dependencies to Evaluation Strategies

Network Interconnection: Bridging CS 571 Fall Kenneth L. Calvert All rights reserved

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. Example: Pancake Problem. Example: Pancake Problem

CSEP 573 Artificial Intelligence Winter 2016

Implementing Automata. CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

CSCI 446: Artificial Intelligence

Position Heaps: A Simple and Dynamic Text Indexing Data Structure

Notes for Graph Theory

CS 221: Artificial Intelligence Fall 2011

Slides for Data Mining by I. H. Witten and E. Frank

CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

CS311H: Discrete Mathematics. Graph Theory IV. A Non-planar Graph. Regions of a Planar Graph. Euler s Formula. Instructor: Işıl Dillig

Taming Subgraph Isomorphism for RDF Query Processing

Dr. D.M. Akbar Hussain

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

Approximation of Two-Dimensional Rectangle Packing

LR Parsing, Part 2. Constructing Parse Tables. Need to Automatically Construct LR Parse Tables: Action and GOTO Table

A New Learning Algorithm for the MAXQ Hierarchical Reinforcement Learning Method

PARALLEL AND DISTRIBUTED COMPUTING

Nearest Keyword Set Search in Multi-dimensional Datasets

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

Alignment of Long Sequences. BMI/CS Spring 2012 Colin Dewey

documents 1. Introduction

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

cisc1110 fall 2010 lecture VI.2 call by value function parameters another call by value example:

CSCI 3130: Formal Languages and Automata Theory Lecture 12 The Chinese University of Hong Kong, Fall 2011

MTH 146 Conics Supplement

A Heuristic Approach for Discovering Reference Models by Mining Process Model Variants

Stack Manipulation. Other Issues. How about larger constants? Frame Pointer. PowerPC. Alternative Architectures

The Greedy Method. The Greedy Method

Inference of node replacement graph grammars

4452 Mathematical Modeling Lecture 4: Lagrange Multipliers

CS201 Discussion 10 DRAWTREE + TRIES

Misrepresentation of Preferences

Efficient Regular Expression Grouping Algorithm Based on Label Propagation Xi Chena, Shuqiao Chenb and Ming Maoc

Engineer To Engineer Note

Algorithm Design (5) Text Search

Available at ISSN: Vol. 4, Issue 2 (December 2009) pp (Previously Vol. 4, No.

From Indexing Data Structures to de Bruijn Graphs

A dual of the rectangle-segmentation problem for binary matrices

LECT-10, S-1 FP2P08, Javed I.

EECS150 - Digital Design Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

Efficient Algorithms For Optimizing Policy-Constrained Routing

Efficient Techniques for Tree Similarity Queries 1

Definition of Regular Expression

Math 142, Exam 1 Information.

Intermediate Information Structures

Languages. L((a (b)(c))*) = { ε,a,bc,aa,abc,bca,... } εw = wε = w. εabba = abbaε = abba. (a (b)(c)) *

Reducing a DFA to a Minimal DFA

F. R. K. Chung y. University ofpennsylvania. Philadelphia, Pennsylvania R. L. Graham. AT&T Labs - Research. March 2,1997.

File Manager Quick Reference Guide. June Prepared for the Mayo Clinic Enterprise Kahua Deployment

Qubit allocation for quantum circuit compilers

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES)

Functor (1A) Young Won Lim 10/5/17

Distributed Systems Principles and Paradigms

such that the S i cover S, or equivalently S

MATH 25 CLASS 5 NOTES, SEP

OUTPUT DELIVERY SYSTEM

Meaningful Change Detection in Structured Data.

Outline. Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen s algorithm Applications of ST

Video-rate Image Segmentation by means of Region Splitting and Merging

Functor (1A) Young Won Lim 8/2/17

Efficient Rerouting Algorithms for Congestion Mitigation

Suffix Tries. Slides adapted from the course by Ben Langmead

Transcription:

Answering Lel-Constrint Rechility in Lrge Grphs Kun Xu Peking University Beijing, Chin xukun@icst.pku.edu.cn Lei Chen Hong Kong Univ. of Sci. & Tech. Hong Kong, Chin leichen@cse.ust.hk Lei Zou Peking University Beijing, Chin zoulei@cse.ust.hk Ynghu Xio Fudn University Shnghi, Chin shwyh@fudn.edu.cn Jeffrey Xu Yu The Chinese Univ. of Hong Kong Hong Kong, Chin yu@se.cuhk.edu.hk Dongyn Zho Peking University Beijing, Chin zdy@icst.pku.edu.cn ABSTRACT In this pper, we study vrint of rechility queries, clled lel-constrint rechility (LCR) queries, specificlly, given lel set S nd two vertices u 1 nd u in lrge directed grph G, we verify whether there exists pth from u 1 to u under lel constrint S. Like trditionl rechility queries, LCR queries re very useful, such s pthwy finding in iologicl networks, inferring over RDF (resource description f rmework) grphs, reltionship finding in socil networks. However, LCR queries re much more complicted thn their trditionl counterprt. Severl techniques re proposed in this pper to minimize the serch spce in computing pth-lel trnsitive closure. Furthermore, we demonstrte the superiority of our method y extensive experiments. Ctegories nd Suject Descriptors H. [Informtion Systems Applictions]: Miscellneous; H..8 [Dtse Mngement]: Dtse Applictions Grph Dtse Generl Terms Algorithm corresponding uthor: Lei Zou, zoulei@icst.pku.edu.cn Lei Zou nd Dongyn Zho were supported y NSFC under Grnt No.100009 nd RFDP under Grnt No. 01000011009. Jeffrey Xu Yu ws supported y RGC of the Hong Kong SAR under Grnt No. 19008 nd 19109. Lei Chen s reserch is prtilly supported y HKUST SSRI11EG01 nd NSFC No.00007. Ynghu Xio ws supported y the NSFC under Grnt No.100001 nd RFDP under Grnt No. 0100071100. Permission to mke digitl or hrd copies of ll or prt of this work for personl or clssroom use is grnted without fee provided tht copies re not mde or distriuted for profit or commercil dvntge nd tht copies er this notice nd the full cittion on the first pge. To copy otherwise, to repulish, to post on servers or to redistriute to lists, requires prior specific permission nd/or fee. CIKM 11, Octoer 8, 011, Glsgow, Scotlnd, UK. Copyright 011 ACM 978-1-0-0717-8/11/10...$10.00. 1. INTRODUCTION The growing populrity of grph dtses hs generted mny interesting dt mngement prolems. One importnt type of queries over grphs re rechility queries [1,,,,, ]. Specificlly, given two vertices u 1 nd u in directed grph G, we wnt to verify whether there exists directed pth from u 1 to u. There re mny pplictions of rechility queries, such s, pthwy finding in iologicl networks [7], inferring over RDF (resource description f rmework) grphs [8], reltionship finding in socil networks [9]. There re two extreme solutions to nswer rechility queries. One pproch is to mterilize the full trnsitive closures of G, enling one to nswer rechility queries efficiently. In the other extreme, one cn perform DFS (depth-f irst serch) or BFS (reth-f irst serch) over grph G until reching the trget vertex, or the serch process cnnot e continued. Oviously, these two methods cnnot work in lrge grph G, since the former needs O(V ) spce to store the trnsitive closure (lrge index spce cost), nd the ltter needs O(V ) time in nswering rechility queries (slow query response time). The key issue in rechility queries is how to find good trde-off etween the two sic solutions. Therefore, mny lgorithms hve een proposed, such s -hop [1, 10, 11], GRIPP [], pth-cover [], tree-cover [, ], pthtree [] nd -hop [], to ddress this issue. In mny rel pplictions, edge lels re utilized to denote different reltionships etween two vertices. For exmple, edge lels in RDF grphs denote different properties. We cn lso use edge lels to define different reltionships in socil networks. In this pper, we study vrint of rechility queries, clled lel-constrint rechility (LCR) queries, which re originlly proposed in [1]. Conceptully, LCR query verifies some specified type of reltionships etween two vertices. Here, we give motivtion exmple to demonstrte the usefulness of LCR queries. Let us consider n inference exmple in RDF Schem (RDFS for short) dtset in Figure 1. Assume tht we wnt to verify whether freshmn is suclss of people. The trditionl dtse system cn simply nswer no, since there exists no triple ( freshmn, rdfs:suclss, people ). However, due to RDFS semntics, rdfs:suclss is trn- 19

Definition.. Given two vertices u 1 nd u in grph G nd lel constrint (set) S = {l 1,..., l n}, we sy tht u 1 cn rech u under lel constrint S (denoted s u 1 u ) if nd only if there exists pth p from u 1 to u nd L(p) S. S () RDF Dtset () RDF Grph Figure 1: Inference VS. LCR Query sitive property. Therefore, given two triples ( freshmn, rdfs:suclss, student ) nd ( student, rdfs:suclss, people ), we cn infer tht ( freshmn, rdfs: suclss, people ). Therefore, the correct nswer should e yes. This kind of queries re clled inference queries. Oviously, it is prohiitive to mterilize ll inferred fcts, ccording to RDFS resoning rules, in very lrge RDF dtset. Actully, some inference queries cn e reduced into LCR queries over RDF grphs. For exmple, the ove exmple cn e trnsformed into the following LCR query: verifying whether there exists directed pth from entry freshmn to people in the RDF grph, where ll edge lels long the pth re rdfs:suclss. Although LCR queries re quite useful, it is non-trivil to nswer LCR queries over lrge directed grph. Trditionl indexing methods cn only verify rechility without considering how the connection is mde etween two vertices [1]. Furthermore, the trditionl trnsitive closure does not contin pth lels. In order to ddress LCR queries efficiently, we mke the following contriutions in this work: 1) For LCR queries, we propose method to trnsform n edge-leled directed grph into n ugmented DAG y representing the mximl strongly connected components s iprtite grphs. Then, sed on the ugmented DAG, we propose method to compute trnsitive closure efficiently. ) We re-define the distnce of pth nd propose Dijkstr-like lgorithm to compute single-source pth-lel trnsitive closure. We prove the optimlity of our lgorithm. ) Extensive experiments confirm tht our method is fster thn the existing ones y orders of mgnitude. For exmple, given rndom network stisfying ER model with 100K vertices nd 10K edges, the method in [1] consumes 77 hours for index uilding. However, given the sme grph, our method only needs 0. hour for indexing uilding. Furthermore, our method cn work well in very lrge RDF grph (Ygo dtset) hving more thn million vertices nd million edges nd 97 edge lels.. BACKGROUND.1 Prolem Definition Definition.1. A directed edge-leled grph G is denoted s G = {V, E,, λ}, where (1) V is set of vertices, nd () E V V is set of directed edges, nd () is set of edge lels, nd () the leling function λ defines the mpping E. Given pth p from u 1 to u in grph G, the pth-lel of p is denoted s L(p) = e p λ(e), where λ(e) denotes e s edge lel. Given grph G in Figure, the numers inside vertices re vertex IDs tht we introduce to simplify description of the grph; nd the letters eside edges re edge lels. Considering pth p 1 = (1,, ), the pth-lel of p 1 is L(p 1) = {c}. Definition.. (Prolem Definition) Given two vertices u 1 nd u in grph G nd lel set S = {l 1,..., l n}, lel-constrint rechility (LCR) query verifies whether u 1 cn rech u under the lel constrint S, denoted s LCR(u 1, u, S, G). For exmple, given two vertices 1 nd in grph G in Figure nd lel constrint S = {c}, it is esy to know tht 1 cn rech under lel constrint S, i.e., LCR(1,, S, G) = true, since there exists pth p 1 = {1,,, }, where L(p 1) S. If S = {c}, query LCR(1,, S, G) returns flse. 1 c P(1,) { p (1,,), p (1,,,,)} 1 L( p ) { c}; L( p ) { } 1 LS(1,) { L( p ), L( p )} { c, } M (1,) { } 1 Figure : A Running Exmple Definition.. Given two vertices u 1 nd u in grph G, P (u 1, u ) denotes ll pths from u 1 to u. The pthlel set from u 1 to u is defined s LS(u 1, u ) = {L(p) p P (u 1, u )}. Consider two pths (1,,,,, ) nd (1,,, ) in P (1, ), where L(1,,,,, ) L(1,,, ). Oviously, if pth (1,,, ) cn stisfy some lel constrint S, pth (1,,,,, ) cn lso stisfy S. Therefore, pth (1,,, ) is redundnt (Definition.) for ny LCR query. Definition.. Considering two pths p nd p from vertex u 1 to u, respectively, if L(p) L(p ), we sy L(p) covers L(p ). In this cse, p is redundnt pth, nd L(p ) is lso redundnt in the pth-lel set LS(u 1, u ). Definition.. The miniml pth-lel set from u 1 to u in grph G is defined s M G(u 1, u ), where 1)M G(u 1, u ) LS(u 1, u ); nd ) there exists no redundnt pth-lel in M G(u 1, u ); nd ) pth-lels in M G(u 1, u ) cover ll pth lels in LS(u 1, u ). Definition.7. Given two vertices u 1 nd u nd lel constrint (set) S, we sy M G(u 1, u ) covers S if nd only if there exists pth p from u 1 to u, where S L(p) nd L(p) M G(u 1, u ). Definition.8. Given grph G, pth-lel trnsitive closure is mtrix M G = [M G(u 1, u )] V (G) V (G), where u 1, u V (G), nd single-source pth-lel trnsitive closure is vector M G(u, ) = [M G(u, u i)] 1 V (G), where u i V (G). When the context is cler, we lso sy trnsitive closure insted of pth-lel trnsitive closure for short. Furthermore, for ese of presenttion, we orrow two opertor definitions (P rune nd ) from reference [1]. P rune( ) is defined s P rune(ls(u 1, u )) = M G(u 1, u ), which mens removing ll redundnt pths. In Figure, P rune(ls(1,) = {, c}) =M G(1, ) = {}, since. 19

( ) ( ) is defined s follows: λ( u 1u ) M G(u, u ) = {λ( u 1u ) L(p 1), λ( u 1u ) L(p ),...}, where L(p i) M G(u, u ) nd λ( u 1u ) denotes the lel of edge u 1u. It is esy to prove tht M G(u 1, u ) = P rune( u V (G) {λ ( u 1u ) M G(u, u )}). Anlogously, we lso define λ( u 1u ) M G(u, ). An extreme pproch to nswering LCR queries is to mterilize trnsitive closure M G. At run time, given query LCR(u 1, u, S, G), LCR queries cn e nswered y simply checking M G(u 1, u ). However, computing M G is much more complicted thn trditionl trnsitive closure. Before the forml discussion, we introduce the following theorem, which forms the sis of our lgorithm nd performnce nlysis. Theorem.1. (Apriori Property) Given one pth p, if one of its supths is redundnt, p must e redundnt.. Existing Approches LCR queries re proposed in [1]. Generlly speking, the method in [1] employs spnning tree T nd prtil trnsitive closure NT to compress the full trnsitive closure. Specificlly, spnning tree T is found in the grph G. Bsed on T, we prtition ll pirwise pths into three ctegories P n nd P s nd P e. All pths in P n contins ll pirwise pths whose strting edges nd end edges re oth non-tree edges. All pths in P s (nd P e) contins ll pirwise pths whose strting (nd ending) edges re tree-edges. In Figure, (,,,1) is pth in P n, since, nd, 1 re non-tree edges. NT (u, v) contins ll pth lels etween u nd v in P n. We cn re-construct M G(u, v) y Eqution 1. Therefore, we cn re-construct the full trnsitive closure y the spnning tree T nd prtil trnsitive closure NT = {NT (u, v), u, v V (G)}. M G(u, v) = {{L(P T (u, u ))} NT (u, v ) {L(P T (v, v))} u Succ(u)ndv Pr ed(v)} (1) where u is rechle from u in the spnning tree T nd L(P T (u, u )) denotes the corresponding pth lel in T ; nd v cn rech v in the spnning tree T nd L(P T (v, v)) denotes the corresponding pth lel in T. 1 c Figure : An Exmple of Tree-Cover Oviously, different spnning trees will led to different NT. In order to minimize the size of NT, uthors introduce weight w(e) for ech edge e, where w(e) reflects tht if e is included in spnning tree, the numer of pth-lels tht cn e removed from NT. Therefore, they propose to use the mximl spnning tree in G. However, it is quite expensive to ssign exct edge weights w(e). Thus, they propose smpling method. For ech smpling seed (vertex), they compute single-source trnsitive closure, sed on which, they propose some heuristic methods to define edge weights. 1 c However, there re two limittions of their method in [1]. First, similr with the counterprt methods in trditionl rechility queries [, ], single spnning tree cnnot compress the trnsitive closure gretly, especilly in dense grphs. Consequently, NT my e very lrge. Second, in order to find the optiml spnning tree T, uthors propose n lgorithm to compute single-source trnsitive closure for ech smpling seed (vertex). However, the serch spce in their lgorithm is not miniml, i.e., contining lrge numer of redundnt pths, which ffect the performnce gretly. The ove two prolems (lrge index size nd expensive index uilding process) ffect the sclility of the method in [1]. Since computing single-source trnsitive closure is lso uilding lock in our method, we rgue tht our method is optiml in terms of serch spce. In order to understnd the superiority of our method, in the full version of this pper [1], we nlyze the lgorithm in [1]. Due to spce limit, the detils re omitted in this pper. In [1], Fn et l. dd regulr expressions to grph rechility queries. Specificlly, given two vertices u 1 nd u, the method in [1] verifies whether there exists directed pth P where ll edge lels long the pth stisfy the specified regulr expression. Oviously, LCR query is specil cse of the prolem in [1]. Fn et l. propose i-directionl BFS lgorithm t runtime. We cn utilize the method in [1] for LCR queries. Given two vertices u 1 nd u over lrge grph G nd lel set S, two sets re mintined for u 1 nd u, respectively. Ech set records the vertices tht re rechle from (resp. to) u 1 (resp. u ) only vi edges of lels in S. We expnd the smller set t time until either the two sets intersect, or they cnnot e further expnded (i.e., unrechle). The prolem of this method lies in its lrge serch spce. The serch strtegy in [1] is different from trditionl BFS lgorithm, since one vertex my e visited multiple times in i-directionl BFS lgorithm [1].. COMPUTING TRANSITIVE CLOSURE As mentioned erlier, compred with trditionl trnsitive closure, it is much more chllenging to compute pth-lel trnsitive closure (Definition.8). This section focuses computing pth-lel trnsitive closure efficiently. We first propose Dijkstr-like lgorithm to compute single-source trnsitive closure efficiently (Section.1). Oviously, it is very expensive to iterte single-source trnsitive closure computtion from ech vertex in G to compute M G. In order to ddress this issue, we propose ugmented DAG (DAG for short) y representing ll strongly connected components s iprtite grphs. The DAG-sed solution is discussed in Section...1 Single-Source Trnsitive Closure This susection discusses how to compute single-source trnsitive closure efficiently, since it is uilding lock in our DAG-sed solution. As discussed in Section., the method in [1] is not optiml. The key prolem is tht some redundnt pths re visited efore their corresponding nonredundnt pths. In order to ddress this issue, we propose Dijkstr-like lgorithm. As we know, in ech step of Dijkstr s lgorithm, we lwys ccess one un-visited vertex tht hs the miniml distnce from the origin vertex. In our lgorithm, we redefine distnce. A distnce of pth p is defined s the numer of distinct edge lels in p insted of 197

the sum of edge weights. Then, ccording to the distnce definition, we dopt the Dijkstr-like lgorithm to compute single-source trnsitive closure. This lgorithm cn gurntee tht ll redundnt pths must e visited fter their corresponding non-redundnt pths. Therefore, ll redundnt pths cn e pruned from the serch spce (Theorem.1). Hep H Step 1. [{}, (1,), ]; Step. [{}, (1,, ), ]; [{c}, (1,,), ]; Step. [{}, (1,,, ), ]; [{c}, (1,,), ]; Step. [{}, (1,,,,), ]; [{c}, (1,,), ]; Step. [{}, (1,,,,,), ] Pth Set RS [{}, (1,), ]; [{}, (1,,), ]; Figure : Algorithm Process [{}, (1,,, ), ]; [{}, (1,,,,), ]; [{}, (1,,,,,), ] Given grph G in Figure, Figure demonstrtes how to compute M G(1, ) from vertex 1 in our lgorithm (i.e., Algorithm 1). Initilly, we set vertex 1 s the source. All vertex 1 s neighors re put into the hep H. Ech neighor is denoted s neighor triple [L(p), p, d], where d denotes the neighor s ID, p specifies one pth from source s to d, nd L(p) is the pth-lel set of p. All neighor triples re rnked ccording to the totl order defined in Definition.1. Since [{}, (1, ), ] is the hep hed (see Figure ), it is moved to pth set RS. When we move the hep hed T 1 into pth set RS, we check whether T 1 is covered (Definition.) y some neighor triple T in RS (Line in Algorithm 1). If so, we ignore T 1 (Lines -); otherwise, we insert T 1 into RS (Lines 7-8). Definition.1. Given two neighor triples T 1 = [L(p 1), p 1, d 1] nd T = [L(p ), p, d ] in the hep H, T 1 T if nd only if 1) L(p 1) < L(p ) ; or ) L(p 1) = L(p ), the orders of T 1 nd T re ritrrily defined. Definition.. Given one neighor triple T 1 = [L(p 1), p 1, d 1], T 1 is redundnt if nd only if there exists nother neighor triple T = [L(p ), p, d ], where L(p 1) L(p ) nd d 1 = d. In this cse, we sy tht T 1 is covered y T. Definition.. Given two neighor triple T 1 = [L(p 1), p 1, d 1] nd T = [L(p ), p, d ], if p 1 is prent pth (or child pth) of p, we sy tht T 1 (or T ) is prent neighor triple (or child neighor triple) of T (or T 1). Then, we put ll child neighor triples (Definition.) of [{}, (1, ), ] into hep H. Considering one neighor of vertex, such s vertex, we put neighor triple [{} L(, ) = {}, (1,, ), ] into H, where (1,, ) is (1, ) s child pth. Anlogously, we put [{} L(, ) = {c}, (1,, ), ] into H. When we insert some neighor triple T [L(p ), p, d ] into H, we first check whether p is non-simple pth. If so, we ignore T (Lines 10-11). Furthermore, we lso check whether there exists nother triple T tht hs existed in H nd T is covered y T, or T covers T (Lines 1-1). If T is covered y T, we ignore T ; otherwise, T is inserted into H. If T covers some triple T in H, T is deleted from H. At Step, the hep hed is [{}, (1, ), ], which is moved to pth set RS. Itertively, we put ll child neighor triples of [{}, (1, ), ] into hep H. At Step, we find tht [{c}, (1,, ), ] is covered y [{}, (1,,,, ), ]. Therefore, we remove [{c}, (1,, ), ] from H. Figure illustrtes the whole process. All pths nd pth-lels in RS re non-redundnt. According to RS, it is strightforwrd to otin M G(1, ). Note tht, our lgorithm stops the infection from the redundnt pth to its child pths (Theorem.1). For exmple, pth (1,,, ) is pruned from serch spce in our lgorithm. Algorithm 1 Single-Source Trnsitive Closure Computtion Require: Input: A grph G nd vertex u in G; Output: A compressed pth tree C-P T (u). 1: Set u s the source. Set nswer set RS=φ nd hep H=φ. : Put ll neighor triples of u into H. : while H φ do : Let T 1 = [L(p 1 ), p 1, d] to denote the hed in H. : if T 1 is covered y some neighor triple T in RS then : Delete T 1 from H 7: else 8: Move T 1 into RS 9: for ech child neighor triple T [L(p ), p, d ] of T 1 do 10: if p is non-simple pth then 11: continue 1: if T is not covered y some neighor triple T in H then 1: Insert T into H 1: if T covers some neighor triple T in H then 1: Delete T from H 1: According to pths in RS to uild C-P T (u). Theorem.1. Given vertex u in grph G, the following sttements out Algorithm 1 hold: 1. (correctness) Any non-redundnt pth eginning from vertex u cn e found in pth set RS in Algorithm 1.. (optimlity) Given ny redundnt pth p, if one of p s prent pths is lso redundnt, p is pruned from the serch spce in Algorithm 1.. Computing Trnsitive Closures Given grph G, strightforwrd method to compute trnsitive closure M G is to repet Algorithm 1 from ech vertex in G. Oviously, it is inefficient to do tht. Intuitively, given two djcent vertices u 1 nd u, most computtions in Algorithm 1 for u 1 nd u re the sme to ech other. Therefore, n efficient lgorithm should employ the property. Usully, directed grph G is trnsformed into DAG y colescing ech strongly connected component into single vertex to compute trnsitive closure efficiently. However, this method cnnot e used for LCR queries, since it my miss some edge lels. Insted, we propose n ugmented DAG D y representing ll strongly connected components s iprtite grphs. Then, we cn compute single-source trnsitive closure M(u, ) ccording to the reverse order of D. During the computtion, M(u, ) is lwys trnsmitted to its prent vertices in D. In this wy, redundnt computtion cn e voided. Specificlly, we first identify ll mximl connected components in grph G, denoted s C 1,..., C m. For ech C i (i = 1,..., m), we compute locl trnsitive closure in C i (denoted s M Ci ) y iterting Algorithm 1 for ech vertex in C i. For exmple, given grph G in Figure (), one mximl connected component C 1 is identified in G. We compute M C1 for C 1. Then, we represent mximl connected component C i s iprtite grph B i = (V i1, V i ), where 198

7 1 8 G 9 C1 8 { } {} 9 7 D {} Figure : Augmented DAG ' B1 In-Portl Out-Portl V i1 contins ll in-portl vertices in C i nd V i contins ll out-portl vertices in C i. A vertex u in C i is clled s n in-portl if nd only if it hs t lest one incoming edge from vertices out of C i. A vertex u in C i is clled s n out-portl if nd only if it hs t lest one outgoing edge to vertices out of C i. If one vertex u is oth n in-portl nd n out-portl, it hs two instnces u nd u tht occur in V i1 nd V i, respectively. For ny two vertices u 1 V i1 nd u V i, we introduce directed edge u 1 to u, whose edge lel is M Ci (u 1, u ). For exmple, iprtite grph B 1 corresponding to C 1 is given in Figure (). Finlly, we replce ll mximl connected components C i y the corresponding iprtite grph B i. In this wy, we cn get grph D, s shown in Figure (). We cn prove tht D must e DAG. In this pper, we cll it ugmented DAG (DAG for short). Note tht, given vertex u in C i, if u / D, u is clled n intr vertex, such s vertices 1, nd in Figure (). Algorithm DAG-Bsed Compute Locl Trnsitive Closure Require: Input: A grph G nd its corresponding DAG D; Output: M G. 1: Identify ll mximl connected component C i. : Employ Algorithm 1 to compute locl trnsitive closure M Ci for ech C i. : Build DAG D y replcing C i y the iprtite grphs B i. : Perform Topologicl Sorting Over D. Set M G (u, ) = {φ,..., φ} for ech vertex u. : for ech vertex u ccording to the reverse topologicl order do : for ech child c i of u do 7: M G (u, ) = P rune(m G (u, ) P rune(λ( uc i ) M G (c i, ))) 8: for ech mximl connect component C i do 9: for ech intr-vertex u C i do 10: M G (u, ) = M Ci (u, ) 11: for ech out-portl u i in V i do 1: M G (u, ) = P rune(m G (u, ) P rune(m Ci (u, u i ) M G (u i, ))) 1: for ech vertex u / C i do 1: for ech intr-vertex u C i do 1: for ech in-portl u i in V 1i do 1: M G (u, u) = P rune(m G (u, u) P rune(m G (u, u i ) M Ci (u i, u))) Algorithm lists the pseudo code to compute trnsitive closure for DAG D. First, we perform topologicl sorting over D. Initilly, for ll vertices u in G, we set M(u, ) = {φ,..., φ}. Then, we process ech vertex ccording to the reverse topologicl sort. If vertex u hs n children c i, for ech c i, we updte M G(u, ) = P rune(m G(u, ) P rune(λ( uc i) M G(c i, ))) itertively. In this wy, we cn otin M G(u, ) for ech vertex u in DAG D. Now, we need to consider intr vertices in ech cluster C i, such s vertices nd 7 in Figure. Given n intr vertex u in C i, we initilize M G(u, ) = M Ci (u, ). Then, for ech out-portl u i in V i, we updte M G(u, ) = P rune(m G(u, ) P rune(m Ci (u, u i) M G(u i, ))) itertively. Consider ny one vertex u / C i. Given n intr vertex u in C i, we compute M G(u, u) s follows: Initilly, we set M G(u, u) = φ. For ech in-portl u i in V 1i, we updte M G(u, u) = P rune(m G(u, u) P rune(m G(u, u i) M Ci (u i, u))) itertively. As we know, -hop leling technique is proposed to compress trditionl trnsitive closure [1, 10]. We lso extend the leling technique to compress the pth-lel trnsitive closure. Definition.. A -lel-hop coding over grph G ssigns to ech vertex u ( V (G)) code C(u) = (C in(u), C out(u)), where the entries in C in(u) nd C out (u) re in form of {(w, M G(w, u)} nd {(w, M G(u, w)}, respectively. Definition.. A -lel-hop coding over grph G is clled complete if nd only if Eqution holds. u 1, u V (G), S + S u 1 u w, w C out(u 1) w C in(u ) ( L(p 1), L(p 1) M G(u 1, w) S L(p 1)) ( L(p ), L(p ) M G(w, u ) S L(p )) where L(p 1) denotes one pth lel-set from u 1 to w, nd L(p ) denotes one pth lel-set from w to u.. EXPERIMENTS In this section, we evlute our methods over oth rndom networks nd rel dtsets, nd compre them with the existing solution the smpling-tree method in [1]. Specificlly, we experimentlly study the performnce of three pproches: 1) the smpling-tree method proposed in [1]; ) we compute pth-lel trnsitive closure method y Algorithm nd compress it y -lel-hop technique. Then, sed on -lel-hop codes, we cn nswer LCR queries. This method is clled trnsitive closure method; ) the idirectionl BFS proposed in [1]. Our methods re implemented using C++, nd our experiments re conducted on P.0GHz mchine with G RAM running Uuntu Linux..1 Dtsets There re two types of synthetic dtsets to e used in our experiments: Erdos Renyi Model (ER) nd Scle-Free Model (SF). ER is clssicl rndom grph model. It defines rndom grph s vertices connected y E edges, chosen rndomly from the ( 1) possile edges. In our experiments, we vry the density E from 1. to.0, nd vry from 1K to 10K. SF defines rndom network with vertices stisfying power-lw distriution in vertex degrees. In our implementtions, we use the grph genertor gengrphwin (http://fien.viger.free.fr/lif/genertion/) to generte lrge grph G stisfying power-lw distriution. Usully, the power-lw distriution prmeter γ is etween.0 nd.0 to simulte rel complex networks [1]. Thus, defult vlue of prmeter γ is set to. in this work. In order to study the sclility, we lso vry in SF networks from 1K to 10K. The numer of edge lels ( Σ ) is 0. The distriution of lels is generted ccording to uniform distriution. () 199

Tle 1: Performnce VS. in ER Grphs V Tle : Grphs -Lel-Hop Codes Smpling-Tree method Bi-directionl BFS (Memorysed) density = 1. IT IS QT IT IS QT QT sec. KB ms. sec. KB ms. ms. 1K 1 1 0.01 11 17 0.0 0.01 K 19 0.01 9 100 0.09 0.0 K 17 90 0.0 80 7890 0.10 0.0 K 9000 0.0 89 9890 0.1 0.0 8K 9 100 0.0 990 100890 0.18 0.0 10K 0 000 0.0 1000 10 0.9 0.0 Performnce VS. Grph Density in ER density -Lel-Hop Codes Smpling-Tree method Bi-directionl BFS (Memory- V =10K IT IS QT IT IS QT QT sec. MB ms. sec. MB ms. ms. 80. 0.08 1890 9. 0.1 0.0 1111 10. 0.09 78.7 0. 0.09 7 10 0.1 F F F 0.1 19 18 0. F F F 0.18 Note: F denotes "fil due to memory crsh" or "cnnot finish index uilding in 8 hours" We lso employ two rel grph dtsets (Yest, Smll- Ygo) in our experiments, which re provided y uthors in [1]. More detils out the two dtsets nd more experiments re given in the full version of this pper [1].. Performnce of Trnsitive Closure Method In this section, we use Algorithm to compute trnsitive closure for grph G. Then, we use -lel-hop coding technique to compress the trnsitive closure. We report index construction time (IT), index size (IS) nd verge query response time (QT) for the experiments on the synthetic dtsets. Note tht, the defult query constrint size ( S ) is 0% =. Furthermore, we lso compre our method with the smpling tree method. Note tht, in the following experiments, we lwys rndomly generte 10000 queries to evlute query performnce. QT is reported s the verge response time for one query. In these experiments, we evlute the performnce with regrd to grph size, grph density nd lel constrint size S. Furthermore, we lso test the performnce of i-directionl BFS in Tle 1. Since i-directionl BFS does not need offline processing, thus, we only report QT in the following experiments. Exp1. Vrying Grph Size () on ER grphs. In this experiment, we fix the density E =1. nd lel constrint size S = nd vry from 1,000 to 10,000 to study the performnce y vrying grph sizes. Tle 1 reports the detiled performnce, such s, index sizes (IS), index uilding times (IT) nd verge query response time (QT). From Tle 1, we know tht trnsitive closure method is fster thn the smpling-tree method in offline processing y orders of mgnitude. For exmple, when =1K, trnsitive closure method only spends 1 seconds to uild index, ut the smpling-tree method needs 11 seconds. The index size of our method is much smller thn tht in the smpling-tree method. Furthermore, our query performnce re lso etter thn the smpling tree method. From Tle 1, we know tht memory-sed i-directionl BFS is lso very fst for LCR queries. However, this method is not sclle with regrd to grph size due to its exponentil time complexity. Exp. Vrying Density ( E ) on ER grphs. In this experiment, we fix =10,000 nd vry the density E from to to study the performnce of our method in dense grphs. From Tle, we know tht the index uilding time nd index size increse when vrying E from to in oth methods. Furthermore, the smpling tree method cnnot finish index uilding in 8 hours when E. From Tle, we know tht trnsitive closure method hs etter sclility with regrd to the grph density E thn the smpling tree method. Actully, the two methods oth need to compute M G(u, ) (i.e., single-source trnsitive closure). As proven in Theorem.1, our method hs the miniml serch spce, ut the serch spce in the smpling tree method is not miniml. Thus, lrge serch spce ffects the slility of the smpling tree method.. CONCLUSIONS In this pper, we ddress lel-constrint rechility (LCR) queries over lrge grphs. Theoreticlly, we propose severl methods to optimize pth-lel trnsitive closure computing. We lso demonstrte the superiority of our method y extensive experiments.. REFERENCES [1] E. Cohen, E. Hlperin, H. Kpln, nd U. Zwick, Rechility nd distnce queries vi -hop lels, SIAM J. Comput., vol., no., 00. [] S. Trißl nd U. Leser, Fst nd prcticl indexing nd querying of very lrge grphs, in SIGMOD, 007. [] R. Jin, Y. Xing, N. Run, nd H. Wng, Efficiently nswering rechility queries on very lrge directed grphs, in SIGMOD, 008, pp. 9 08. [] R. Jin, Y. Xing, N. Run, nd D. Fuhry, -hop: high-compression indexing scheme for rechility query, in SIGMOD. [] H. V. Jgdish, A compression technique to mterilize trnsitive closure, ACM Trns. Dtse Syst., vol. 1, no., 1990. [] H. Wng, H. He, J. Y. 0001, P. S. Yu, nd J. X. Yu, Dul leling: Answering grph rechility queries in constnt time, in ICDE, 00. [7] S. Lu, F. Zhng, J. Chen, nd S.-H. Sze, Finding pthwy structures in protein interction networks, Algorithmic, vol. 8, no., 007. [8] J. P. McGlothlin nd L. R. Khn, Rdfk: efficient support for rdf inference queries nd knowledge mngement, in IDEAS, 009. [9] J. Zhu, Z. Nie, X. Liu, B. Zhng, nd J.-R. Wen, Sttsnowll: sttisticl pproch to extrcting entity reltionships, in WWW, 009. [10] J. Cheng, J. X. Yu, X. Lin, H. Wng, nd P. S. Yu, Fst computing rechility lelings for lrge grphs with high compression rte, in EDBT, 008. [11] R. Brmndi, B. Choi, nd W. K. Ng, Incrementl mintennce of -hop leling of lrge grphs, IEEE Trns. Knowl. Dt Eng., vol., no., 010. [1] R. Jin, H. Hong, H. Wng, N. Run, nd Y. Xing, Computing lel-constrint rechility in grph dtses, in SIGMOD, 010. [1] L. Zou, K. Xu, J. X. Yu, L. Chen, Y. Xio, nd D. Zho, Answering lel-constrint rechility in lrge grphs, Peking University, Tech. Rep., 011. [Online]. Aville: http://www.icst.pku.edu.cn/intro/leizou/tr/ lelconstrintquery.pdf [1] W. Fn, J. Li, S. M, N. Tng, nd Y. Wu, Adding regulr expressions to grph rechility nd pttern queries, in ICDE, 011. [1] Rék Alert nd Alert-László Brási, Sttisticl mechnics of complex networks, Reviews of Mordern Physics, vol. 7, pp. 7 97, 00. 100