TF-Label: a Topological-Folding Labeling Scheme for Reachability Querying in a Large Graph

Size: px

Start display at page:

Download "TF-Label: a Topological-Folding Labeling Scheme for Reachability Querying in a Large Graph"

Magdalene Jefferson
5 years ago
Views:

1 TF-Label: a Topologcal-Foldng Labelng Scheme for Reachablty Queryng n a Large Graph James Cheng, Slu Huang, Huanhuan Wu, Ada Wa-Chee Fu Department of Computer Scence and Engneerng The Chnese Unversty of Hong Kong {jcheng, slhuang, hhwu, adafu}@cse.cuhk.edu.hk ABSTRACT Reachablty queryng s a basc graph operaton wth numerous mportant applcatons n databases, network analyss, computatonal bology, software engneerng, etc. Although many ndexes have been proposed to answer reachablty queres, most of them are only effcent for handlng relatvely small graphs. We propose TF-label, an effcent and scalable labelng scheme for processng reachablty queres. TF-label s constructed based on a novel topologcal foldng (TF) that recursvely folds an nput graph nto half so as to reduce the label sze, thus mprovng query effcency. We show that TF-label s effcent to construct and propose effcent algorthms and optmzaton schemes. Our experments verfy that TF-label s sgnfcantly more scalable and effcent than the stateof-the-art methods n both ndex constructon and query processng. Categores and Subject Descrptors E.1 [DATA]: DATA STRUCTURES Graphs and networks General Terms Algorthms, Performance Keywords Graph reachablty, graph ndexng, graph queryng 1. INTRODUCTION A reachablty query asks whether there exsts a path from one vertex to another vertex n a drected graph. Reachablty queryng s one of the fundamental operatons n drected graphs. It has a wde range of applcatons such as processng recursve queres n data and knowledge base management, queryng assocatons and logcal reasonng n Web and Semantc Web graphs, pattern matchng n graphs and XML documents, analyzng the bologcal functon of genes, checkng connectons n geographc navgaton systems, socal network analyss, ontology queryng, program analyss, and many more. Reachablty queryng has been extensvely studed n the past [1, 2, 3, 4, 5, 6, 10, 11, 12, 14, 17, 18, 19, 20, 21, 23, 24, 25, Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. SIGMOD 13, June 22 27, 2013, New York, New York, USA. Copyrght 2013 ACM /13/06...$ , 28]. In recent years, there s a shft of nterest to handle large graphs. The more recent works [6, 18, 19, 25, 28] have hghlghted the applcatons of reachablty queryng n large graphs such as Web graphs, Semantc Web and RDF graphs, socal networks, large XML databases, etc., and more efforts have been gven to the development of scalable methods for answerng reachablty queres. As ponted out n [18], most exstng methods can only handle relatvely small graphs wth tens to hundreds of thousands vertces and edges. For processng larger graphs, these methods are ether too costly n ndexng or n query processng (more dscusson n Secton 9), thus lmtng ther applcaton to real world graphs. For graphs wth mllons of vertces and edges, only a few methods can process them wth reasonably good effcency [19, 25, 28]. For larger graphs wth tens of mllons of vertces and edges, the only known method that attans reasonable ndexng and queryng effcency s the recently proposed backbone structure [18]. A reachablty query, where a vertex s can reach another vertex t, can be answered by (1) frst fndng all backbone vertces B s that can be reached from s and all backbone vertces B t that can reach t, and then (2) check whether any vertex n B s can reach any vertex n B t. Any exstng method can be appled to the backbone graph to process Step (2), and queryng s generally faster snce the backbone can be sgnfcantly smaller than the orgnal graph. Although the backbone s used as a general framework (called SCARAB [18]) to further mprove the scalablty of a reachablty ndex (ncludng ours), an effcent and scalable method tself s stll most crucal for query performance for the two man reasons (both verfed n our experments). Frst, SCARAB tself may not be scalable to large graphs. Second, the backbone of a large graph may stll be too large for exstng methods. We propose an effcent and scalable labelng scheme, whch can process large graphs that cannot be handled by SCARAB and other exstng methods. Gven the labels of s and t,.e., a set of vertces that are reachable from s and can reach t respectvely, we can answer whether s can reach t effcently by smply ntersectng ther labels (as n [14]). We gve the man dea of our method as follows. We propose a novel data structure, called topologcal foldng (TF), based on whch we develop our labelng scheme, TF-label. Gven a drected graph, we can convert t nto a drected acyclc graph (DAG) by condensng each strongly connected component (SCC) n the graph nto a super node. Reachablty queres can be answered on the DAG snce all vertces are reachable from each other wthn an SCC. We defne a topologcal structure T for the DAG. TF s ntutvely a structure obtaned by foldng T nto half each tme, whch essentally mples a great reducton n the label sze as labelng s processed n O(lg l) levels nstead of a total of l levels n T. Then, we apply a labelng technque, nspred by the

2 work of [16], on the TF structure to construct labels for answerng reachablty queres. We summarze the man contrbuton of our work as follows. We propose an effcent and scalable TF-based labelng scheme for reachablty query processng. We propose optmzaton technques such as specal handlng of hgh-degree vertces to further mprove the scalablty. We propose effcent algorthms for constructng the TF structure and then the labels from the TF, as well as the optmzaton technques. Our experments on a wde spectrum of real and synthetc datasets verfy that TF-label acheves compettve ndexng performance and sgnfcantly better query performance than the state-of-the-art methods [18, 19, 25, 28]. In many cases, TF-label s an order to several orders of magntude faster n query processng. We also show that TF-label s more scalable and has stable performance wth the change n varous graph propertes. The rest of the paper s organzed as follows. We frst gve some basc notatons and problem defnton n Secton 2. Then, through Sectons 3 to 7 we present the detals of TF and TF-label wth ther desgn and algorthms. We evaluate the performance of TF-label n Secton 8. Fnally, we dscuss related work n Secton 9 and conclude the paper n Secton NOTATIONS/PROBLEM DEFINITION Gven a drected graph G, a reachablty query asks whether there s a path from a vertex u to another vertex v n G. We assume u v as t s trval to process u = v. Formally, a drected edge,or smply an edge (snce all edges are drected n ths paper), from u to v s denoted by (u, v). Apath P from v 1 to v p n G s defned by P = v 1,...,v p such that (v,v +1) s an edge n G for 1 < p. Weuseu v to ndcate that u can reach v (or v s reachable from u), and u v to ndcate that u cannot reach v. Gven any two vertces u and v n a strongly connected component (SCC) ofg, u can always reach v. Wth ths observaton, exstng methods frst compute a compressed graph, G =(V G,E G), of G as follows: the set of vertces V G of G s the set of SCCs of G, and a drected edge s created n G from one SCC C 1 to another SCC C 2 f there exsts a drected edge (v 1,v 2) n G, wherev 1 s a vertex n C 1 and v 2 s a vertex n C 2. Then, a reachablty query s answered by checkng whether there s a path from C u to C v n G, where C u,c v V G, u s a vertex n C u and v s a vertex n C v. The compressed graph G created above s n fact a drected acyclc graph (DAG). Thus, for smplcty, we call G the DAG of G n ths paper. Snce the SCCs of G can be computed effcently [15], we follow the conventon of exstng methods and assume that the nput to our algorthm s the DAG of the nput drected graph. Gven a DAG, G =(V G,E G), we defne the set of n-neghbors (out-neghbors)ofavertexv V G as nb n(v, G) ={u :(u, v) E G} (nb out(v, G) = {u : (v, u) E G}), and the n-degree (out-degree) ofv as deg n (v, G) = nb n(v, G) (deg out (v, G) = nb out(v, G) ). Problem defnton. We study the followng problem: gven a DAG G =(V G,E G), compute a set of vertex labels (also called an ndex) for processng reachablty queres,.e., gven s, t V G, the query whether s can reach t can be effcently answered usng the labels of s and t. 3. TOPOLOGICAL FOLDING Through Sectons 3 to 6, we present our man ndexng scheme, called TF-label, whch s desgned based on a novel topologcal foldng scheme of the DAG of a drected graph. We frst present the concept of topologcal foldng n ths secton. 3.1 Basc Topologcal Foldng Gven a DAG G =(V G,E G), we start by assgnng each vertex n G a topologcal level number as follows. DEFINITION 1 (TOPOLOGICAL LEVEL NUMBER). Gven a DAG G =(V G,E G), thetopologcal level number of a vertex v V G, denoted by l(v, G), s defned as follows: If nb n(v, G) = : l(v, G) =1; Else: l(v, G) =max{(l(u, G)+1):u nb n(v, G)}. The topologcal level number of G, denoted by l(g), s gven by l(g) =max{l(v, G) :v V G}. Snce G s a DAG, t s easy to see that every vertex v V G has exactly one topologcal level number, whch can be derved from a topologcal orderng of the DAG. Gven the topologcal level number, we now defne the topologcal levels of a DAG and state an mportant property that wll be used n the defnton of topologcal foldng later on. DEFINITION 2 (TOPOLOGICAL LEVELS). A DAG G = (V G,E G) conssts of t topologcal levels of vertces, denoted by {L 1(G),...,L t(g)}, wheret = l(g), and L (G) ={v : v V G,l(v, G) =} for 1 t. LEMMA 1. Each topologcal level L (G) of a DAG G,for1 l(g), sanndependent set of G. PROOF. L (G) s an ndependent set of G f u, v L (G), (u, v) / E G and (v, u) / E G. Suppose to the contrary f (u, v) E G or (v, u) E G,thenwehaveetherl(u, G) <l(v, G) or l(v, G) <l(u, G), contradctng the fact that u, v L (G),.e., l(u, G) =l(v, G) =. To clearly llustrate the concepts, for now let us assume that the DAG G only has edges gong from vertces n L (G) to vertces n L +1(G), and there s no edge gong from any vertex n L (G) to a vertex n L j(g) where j>+1(we wll handle such edges n Secton 3.2). We call such a DAG a k-partte DAG, wherek = l(g). Fgure 1(a) shows an example of a k-partte DAG where k =6. We defne a topologcal foldng scheme that recursvely folds up G by takng away half of the levels, as follows. DEFINITION 3 (TOPOLOGICAL FOLDING (TF)). Gven a l(g)-partte DAG G =(V G,E G), thetopologcal foldng (TF) of G s a set of DAGs, G = {G 1,G 2,...,G f }, where each G =(V G,E G ) s defned as follows: V G1 = V G and for 2 f, V G = 1 j l(g 1 )/2 L2j(G 1); For 1 f, E G s a set of edges wth whch G s a l(g )-partte DAG and u, v V G, u v n G f and only f u v n G. The topologcal foldng number, or TF number, of G, denoted by tf(g), s gven by tf(g) =f = G = log 2 l(g) +1.

3 Intutvely, TF folds each G nto half (.e., takng away half of the levels together wth ther vertces) to obtan G 1, startng from G 1 = G to G f whch has only one level and cannot be folded any more. Hence, we have the name topologcal foldng. To correctly process reachablty queres, t s necessary for the edge sets E G to mantan the reachablty of the vertces. To effcently process reachablty queres, we also want each E G to be as small as possble. The followng lemma leads to a smple and effcent method to construct E G. LEMMA 2. Let G = (V G,E G) be a l(g)-partte DAG and G = {G 1,G 2,...,G tf(g) } be a topologcal foldng of G. For 2 tf(g), V G 1 \V G s an ndependent set of G 1. PROOF. Accordng to Lemma 1, each L j(g 1) for 1 j l(g 1) s an ndependent set of G 1. Accordng to the defnton of G, V G1 = V G and for 2 tf(g), V G 1 \V G are the vertces at all the odd levels of G 1. Snce each G 1 s a l(g 1)-partte DAG, the unon of the vertces at all the odd levels of G 1 s clearly an ndependent set of G 1. We construct the edge sets E G as follows. E G1 = E G; For 2 tf(g), E G s constructed from G 1 as follows: for each v L j(g 1), wherej s odd, create a new edge n E G from each n-neghbor (f any n G 1) ofv to each out-neghbor (f any n G 1) ofv. LEMMA 3. The edge sets E G constructed above gve a vald topologcal foldng G of a l(g)-partte DAG G =(V G,E G). PROOF. Frst, each G s a l(g )-partte DAG snce each edge n E G only goes from L j(g ) to L j+1(g ), for1 j l(g ). Second, reachablty from each vertex to another s mantaned because each u n L j 1(G 1) s connected to each u out L j+1(g 1) by an edge n E G f the edges (u n,v) and (v, u out) exst n G 1,wherev L j(g 1) and j s odd. Note that the correctness of the proof of Lemma 3 also depends on the valdty of Lemma 2, because f any edge (u, v), where u, v V G 1 \V G,exstsnG 1, then the reachablty establshed n the proof of Lemma 3 wll not be vald. The followng example llustrates the dea of topologcal foldng. EXAMPLE 1. Fgure 1 shows the topologcal foldng of a 6- partte DAG G (l(g) =6). G 2 s constructed from G 1 by addng edges (c, f), (d, f), and (f,h), and then removng all vertces n the odd levels of G 1. Next, odd level vertces of G 2 are removed to form G Dealng wth Cross-Level Edges In Secton 3.1 we ntroduced the basc concepts and structure of topologcal foldng of a DAG and some of ts essental propertes. However, the DAG G of a real world drected graph s rarely l(g)- partte. On the contrary, there can be many cross-level edges n G,.e., there can be edges from vertces n L (G) to vertces n L j(g), where 1 <+1<j l(g), as shown n Fgure 2. To deal wth these cross-level edges n the DAG, we observe that each DAG G n a topologcal foldng G need not be l(g )-partte, but only need the followng essental propertes to be mantaned n each G : (1) the set of vertces to be removed from G s an ndependent set of G 1 for 2 tf(g); and(2) u, v (V G V G), u v n G f and only f u v n G. 1 Fgure 1: Topologcal foldng To construct each G that satsfes the above two propertes, we devse a transformaton scheme for G 1, for2 tf(g), wth whch we construct the correspondng transformed topologcal foldng as follows: Procedure 1. TRANSFORMED TF CONSTRUCTION: 1. G 1 = G,andset =1; 2. Intalze G =G, then do the followng three steps n order: 2.1. For 1 j l(g ) and j s odd, for each v L j(g ): Let U =(L k (G ) nb out(v, G )),wherek>j+1. If U, then add a dummy vertex w to L j+1(g ),add a new edge set {(w, u out) :u out U} and a new edge (v, w) to E G, and remove the edge set {(v, u out) : u out U} from E G For 1 j l(g ) and j s odd, for each v L j(g ): Let U =(L k (G ) nb n(v, G )), wherek<j 1 and k s even. If U, then add a dummy vertex w to L j 1(G ), add a new edge set {(u n,w):u n U} and a new edge (w, v) to E G, and remove the edge set {(u n,v):u n U} from E G For 1 j l(g ) and j s odd, for each v L j(g ): add a new edge set {(u n,u out) :u n (L j 1(G ) nb n(v, G )),u out (L j+1(g ) nb out(v, G ))} to E G. 3. If l(g ) > 1, ntalze G +1 = G, and remove all vertces at odd levels of G +1 together wth all edges ncdent to them; then, set = +1and go to Step 2. Otherwse, return the transformed topologcal foldng G = {G 1,...,G tf(g)} and qut. Note that Step 2.2 gnores all Level-k n-neghbors of v f k s odd, because for ths case a dummy vertex must have been created at an even level n Step 2.1, and s thus also handled n Step 2.2. Also note that we do not ncrease the number of levels n any G or G, and hence tf(g) s stll defned n the same way as n Defnton 3. We also defne the TF number of a vertex as follows. DEFINITION 4 (TOPOLOGICAL FOLDING NUMBER). Let G =(V G,E G) be a DAG, G = {G 1,...,G tf(g)} be the transformed topologcal foldng of G, and let V be the set of dummy vertces created n G.TheTF number of a vertex v (V G V ), denoted by tf(v), s gven by tf(v) =max{ : v V G }. The TF number of G s gven by tf(g) = G = log 2 l(g) + 1. Also note that tf(g) =max{tf(v) :v V G}. We llustrate the concept usng the followng example. 3 2

4 a c e b d f g a b 1 c d a 1 c d b1 2 f b 1 a 1 a 2 e 3 h h e 1 a 1 f G 2 G 2 * g e 1 c d e 1 f b 1 neghbor of w,wehaveu v n G f and only f u v n G j for any u, v (V G V G j ), whch together wth (2) mples (3). Note that by a recursve analyss on (3) of Lemma 4, we can actually prove a stronger lemma that shows u v n G f and only f u v n G j,forallu, v (V G V G j ),where1 j< tf(g) (nstead of j = 1 as n (3) of Lemma 4). 6 h G=G 1 G 1 * h 1 a 2 e 1 f G 3 G 3 * Fgure 2: Transformed topologcal foldng EXAMPLE 2. Fgure 2 shows the transformed topologcal foldng of a DAG. The DAG G n Fgure 2(a) contans a number of cross-level edges: (a, h), (b, f), (d, f), (e, g). ByProcedure1,we frst transform G = G 1 to G 1. At level 1, Step 2.1 s executed, we add dummy vertex a 1 for a, and add edges (a, a 1) and (a 1,h),then edge (a, h) s removed; smlarly, we add b 1, (b, b 1) and (b 1,f), and remove (b, f). Next consder level 3, e 1 s added for e, and we add (e, e 1), (e 1,g), and remove (e, g). At Step 2.3, we add (c, e 1) and (c, f). Fnally for level 5, at Step 2.3, we add (e 1,h) and (f,h). Thus, we have constructed G 1,.e., the fgure on the rght n Fgure 2(a). Note that n G 1, the vertces at all the odd levels are ndependent of each other. At Step 3 these vertces are removed, and we obtan G 2, as shown n Fgure 2(b). Repeatng the process, we obtan G 2 and G 3, whle G 3 s smply the same as G 3. By Defnton 4, tf(v) =1for v {a, b, e, g} snce ther last occurrence s n G 1. Smlarly, tf(v) =2for v {a 1,c,d,b 1,h}, tf(v) =3for v {a 2,e 1,f}, and tf(g) =3. One concern n the process of Procedure 1 s that many dummy vertces and edges may be created. We wll handle these cases n Sectons 5 and 6. In fact, G (or G ) s also not useful for reachablty processng and hence deleted after the labelng process. The followng lemma are mportant n establshng the correctness of our method for reachablty query answerng n Secton 4.1. LEMMA 4. Let G = {G 1,...,G tf(g)} be the transformed topologcal foldng of a DAG G =(V G,E G). LetG be the graph from whch G s transformed. Then, (1) V G 1 \V G s an ndependent set of G 1 for 2 tf(g); and (2) u, v (V G V G ),where1 tf(g), u v n G f and only f u v n G ; and (3) u, v (V G V G j ),wherej = 1 and 1 < tf(g), u v n G f and only f u v n G j. PROOF. We frstprove (1). Accordng to Procedure 1, we obtan G by removng the odd levels of G 1,.e., VG 1 \V G. Snce there s no edge from a vertex to another vertex at the same level n G 1, each level of G 1 s an ndependent set of G 1. Forany edge that goes from u at an odd level to v at another odd level, the edge s removed from G 1 and a dummy vertex s created to preserve the connecton from u to v. Thus, for any u, v VG 1 \V G, (u, v) does not exst n G 1. Next we prove (2). FromG to G, Procedure 1 ether converts a cross-level edge to a path wth a mddle dummy vertex or adds an edge from an n-neghbor to an out-neghbor of an odd-level vertex n G. Thus, n both cases, (2) s true. Lastly, we prove (3). Accordng to Procedure 1, all the crosslevel edges n G j are removed from G j and hence a vertex w at L k of G j,where1 k l(g j ) and k s odd, has only n-neghbors at L k 1 (f any) and out-neghbors at L k+1 (f any). Snce Procedure 1 creates an edge from every n-neghbor of w to every out- 4. LABELING AND QUERY ANSWERING In ths secton, we present our TF-based labelng scheme and dscuss reachablty query answerng usng the labels. 4.1 The Labelng Scheme The label of a vertex s defned as follows. DEFINITION 5 (VERTEX LABEL). Let G =(V G,E G) be a DAG, G = {G 1,...,G tf(g)} be the transformed topologcal foldng of G, and let V be the set of dummy vertces created n G.Then-label and out-label of a vertex v (V G V ), denoted by label n(v) and label out(v), are defned as follows: label n(v): (1) v label n(v), and (2) for any u label n(v), nb n(u, G tf(u)) label n(v). label out(v): (1) v label out(v), and (2) for any u label out(v), nb out(u, G tf(u)) label out(v). Intutvely, we add to label n(v) and label out(v) recursvely the n-neghbors and out-neghbors n the foldng graph G of each vertex u currently n label n(v) and label out(v), where = tf(u). The followng property between a vertex and ts nneghbors/out-neghbors shows that, n constructng the labels for a vertex, we only go for reachable vertces wth hgher TF number and gnore all other reachable vertces. Ths s a crucal desgn prncple of our labelng scheme that leads to a sgnfcant reducton on the label sze (compared wth transtve closure), snce each vertex has O(l(G)) levels of reachable vertces, but only O(lg l(g)) levels of reachable vertces wth hgher TF number. LEMMA 5. If w nb n(u, G tf(u)) or w nb out(u, G tf(u)), then tf(w) >tf(u). PROOF. Snce w s n G tf(u), wehavetf(w) tf(u). However, tf(w) = tf(u) mples that both w and u are n an ndependent set of G tf(u), whch contradcts the fact that the edge (u, w) or (w, u) exsts n G tf(u). Thus, tf(w) tf(u) and tf(w) >tf(u). We use the followng example to llustrate the labelng scheme. EXAMPLE 3. Consder the labelng for vertex a. Intally, a s added to label n(a) and label out(a). Snce tf(a) = 1 and nb n(a, G 1) =, we fnalze label n(a) = {a}. Next, snce nb out(a, G 1) = {a 1,c,d}, {a 1,c,d} are added to label out(a). Snce a 1 has an out-neghbor a 2 n G tf(a 1 ) = G 2, we add a 2 to label out(a). We also add {e 1,f} to label out(a) for nb out(c, G 2)={e 1,f} and nb out(d, G 2)={f}. The vertces {a 2,e 1,f} have TF number of 3 but they have no out-neghbor n G 3, and hence the labelng for a s completed. The labels for all vertces are shown n Table Reachablty Queryng usng Labels We now dscuss how we use the vertex labels to process reachablty queres. Gven two vertces s and t n G, we ask whether s can reach t, the query answer s gven by the followng equaton.

5 vertex label out label n a {a, a 1,c,d,e 1,f} {a} b {b, b 1,d,f} {b} e {e, e 1,f} {c, e} g {g, h} {e 1,f,g} a 1 {a 1, a 2 } {a 1 } c {c, e 1,f} {c} d {d, f} {d} b 1 {b 1,f} {b 1 } h {h} {a 2, e 1,f,h} a 2 {a 2 } {a 2 } e 1 {e 1 } {e 1 } f {f} {f} Table 1: Labelng for the example n Fgure 2 { true, f labelout(s) label s t = n(t) ; (1) false, f label out(s) label n(t) =. We gve an example of reachablty query processng as follows. EXAMPLE 4. Consder the example n Fgure 2, the labelng s shown n Table 1. Suppose the query s to ask whether c can reach h: sncelabel out(c) label n(h) ={e 1,f}, the answer s true. Now consder whether a can reach b: sncelabel out(a) label n(b) =, the answer s false. Lemmas 6-9 and Theorem 1 establsh the correctness of reachablty query answerng by Equaton (1). The lemmas themselves also reveal mportant propertes and the desgn of the TF structure, and hence how TF labelng works for reachablty query answerng. LEMMA 6. Gven a path P = u 1,...,u α n any graph n G, there exsts a sequence of vertces S = u 1 = v 1,...,v β = u α such that for 1 <β: (1) the edge (v,v +1) s n G j where j =mn(tf(v ),tf(v +1)); and (2) the sequence S s maxmal,.e., no sub-sequence can be nserted between any v and v +1 such that the resultant sequence also satsfes (1). PROOF. The path P mples that there exsts a sequence S = u 1,S 1,u 2,S 2,...,u α 1,S α 1,u α, where each S for 1 < α s constructed (accordng to Procedure 1) as follows. If l(u,g j ) = l(u +1,G j ) + 1, where j = mn(tf(u ),tf(u +1)), then ether u or u +1 wll be removed n G j+1 and hence S must be an empty set. In ths case, we have (u,u +1) n G j. Otherwse, (u,u +1) s a cross-level edge n G j,wherej = mn(tf(u ),tf(u +1)), thens s a sequence of dummy vertces. Assume j = tf(u ) (the case j = tf(u +1) can be processed smlarly). To preserve the reachablty from u to u +1 n G j, at least one dummy vertex w must be created n G j together wth the edges (u,w) and (w, u +1). Thus, we have the edge (u,w) n G j. If (w, u +1) s stll a cross-level edge n G j, where j = mn(tf(w),tf(u +1)), then another dummy vertex s to be created n G j to preserve the reachablty from w to u +1 n G j. A recursve expanson n ths way gves the subsequence S = u = w 1,w 2,...,w γ 1,w γ = u +1, where S = w 2,...,w γ 1, andfor1 k<γ, (w k,w k+1 ) n G j and j =mn(tf(w k ),tf(w k+1 )). S s ensured to be maxmal f the above recursve expanson s executed untl no more sub-sequence can be generated. By relabelng the vertces, we obtan S = u 1 = v 1,...,v β = u α such that S satsfes both (1) and(2). Lemma 6 s used to show that a sequence of vertces S wth a specal property (as specfed n the lemma) exsts for a path P n any graph n G. The exstence of such a sequence s essental n provng the correctness of Lemma 9 and hence Theorem 1. LEMMA 7. Gven a sequence of vertces S = s = v 1,...,v β = t, where for 1 < β,theedge(v,v +1) s n G j where j =mn(tf(v ),tf(v +1)): f s and t are both n some graph G φ G,thens t n G φ. PROOF. Frst, each edge (v,v +1) n G j mples v v +1 n G j. We can derve the reachablty from v 1 to v β n G φ as follows. Consder the vertex v S where tf(v ) <φand tf(v ) tf(v) for all v S\{v }. If v exsts n S, then accordng to Procedure 1, v 1 must be connected to v +1 n G tf(v ) n order to preserve the the reachablty from v 1 to v +1 va v. Thus, removng v from S we stll have v 1 v +1 n G j,wherej = mn(tf(v 1),tf(v +1)). We repeat the above process wth S = S\{v } untl we have tf(v) φ for all remanng vertces v n S, and let S = s = v 1,...,v β = t be the new sequence obtaned at the end of ths process. We contnue wth S as follows. Consder the vertex v S that s not n G φ and tf(v ) tf(v) for all v S \{v }. Ifv exsts n S,thenwehavev 1 v n G tf(v 1 ) and v v +1 n G tf(v +1 ).Sncev s not n G φ and tf(v ) >φ, v s a dummy vertex and v preserves the reachablty from v 1 to v +1 n G j,wherej =mn(tf(v 1),tf(v +1)). Thus, removng v from S we stll have v 1 v +1 n G j.we repeat the above process wth S = S \{v } untl all the remanng vertces are n G φ.lets = s = v 1,...,v β = t be the new sequence obtaned at the end of ths process. Note that both s and t are stll n S snce s and t are n G φ. Accordng to the dervaton process, we have v v +1 n G φ for 1 β, from whch we have s = v 1 v β = t. Thus, s t n G φ. Lemma 7 reveals an mportant reachablty relaton between vertces n a sequence as defned n Lemma 6. Ths reachablty relaton s also crucal n the proofs of Lemmas 8 and 9. LEMMA 8. Gven two vertces s, t V G, f there exsts a vertex x label out(s) label n(t),thens t n G. PROOF. Let us frst assume that x s and x t. Then, accordng to Defnton 5, f x label out(s), there exsts a vertex u label out(s) such that x nb out(u, G tf(u)). Moreover, u label out(s) n turn mples that there exsts u label out(s) such that u nb out(u,g tf(u )). Thus, we obtan a sequence S out = s = u 1,...,u α = x, where for 1 < α the edge (u,u +1) s n G tf(u ). Smlarly, we obtan another sequence S n = x = v β,...,v 1 = t, where for 1 < β the edge (v +1,v ) s n G tf(v ). Accordng to Lemma 5, tf(u ) < tf(u +1) for 1 < α and tf(v ) < tf(v +1) for 1 < β. Thus, accordng to Lemma 7, the sequence S = s = u 1,...,u α = x = v β,...,v 1 = t mples that s t n G 1, and hence s t n G = G 1 by Lemma 4. If x = t, then t label out(s) gves the sequence S = s = u 1,...,u α = x = t, whch mples that s t n G. And smlarly for x = s. The followng lemma proves the reverse statement of Lemma 8. LEMMA 9. Gven two vertces s, t V G,fs t n G, then there exsts a vertex x label out(s) label n(t). PROOF. Weshowthatfs t n G, then there exsts a sequence of vertces S = s,...,t such that there s a vertex x n S, where x label out(s) and x label n(t). Frst, s t n G mples that there s a path P = s =...,t n G 1 (by Procedure 1 and Lemma 4). Accordng to Lemma

6 6, there exsts a sequence S = s = w 1,...,w γ = t such that for 1 < γ, the edge (w,w +1) s n G j where j = mn(tf(w ),tf(w +1)),andS s maxmal. Next, we show that there exsts a unque vertex x n S such that tf(x) >tf(w) for all w S\{x}. It s trvally true that there exsts x such that tf(x) tf(w) for all w S\{x}. Tore- move the = sgn, suppose to the contrary that there exsts another vertex x such that tf(x ) = tf(x) = j, whch mples that x and x are both n G j. Assume, wthout loss of generalty, that x appears before x n S. Then, tf(x ) = tf(x) = j mples that x and x are both n an ndependent set of G j accordng to Lemma 4. The ndependence between x and x mples that ether (1) x x or (2) x reaches x va some other vertex x n G j such that tf(x ) >tf(x). For (1), t s a contradcton snce x x n G j accordng to Lemma 7. For (2), we have the path P = x,...,x,...,x n G j and by Lemma 6 we can obtan another sequence S = x,...,x,...,x from P, whch contradcts to the fact that S s maxmal. We complete the proof by showng that the unque vertex x, where tf(x) >tf(w) for all w S\{x}, s n both label out(s) and label n(t). LetS = s = u 1,...,u α = x = v β,...,v 1 = t. We frst consder the sub-sequence s = u 1,...,u α = x. If s = u 1 = u α = x,thenx label out(s) by Defnton 5. If α>1, for each u, we fnd the frst u j,where1 <j α, such that tf(u ) <tf(u j). Suchau j must exst snce there s at least one vertex u α where tf(u ) <tf(u α). Moreover, u u j n G tf(u ) accordng to Lemma 7. Thus, (u,u j) s an edge n G tf(u ) because otherwse, u reaches u j n G tf(u ) va some other vertex u k, whch contradcts to the fact that S s maxmal. Thus, we obtan a sequence s = u 1,...,u α = x, where tf(u ) < tf(u +1) and (u,u +1) s an edge n G tf(u ) for 1 <α. Accordng to Defnton 5, s = u 1 label out(s), u 2 label out(s) snce u 1 label out(s) and u 2 nb out(g tf(u )), 1..., u +1 label out(s) snce u label out(s) and u +1 nb out(g tf(u )),..., x = u α labelout(s) snce u α 1 label out(s) and u α nb out(g tf(u )). Fnally, a smlar analyss shows that x label α 1 n(t). We note that the sequence S n the proof of Lemma 9 may not be unque, but we only need to show the exstence of one such sequence for the proof. The followng theorem proves the correctness of reachablty query answerng by vertex labels. THEOREM 1. Gven a reachablty query whether a vertex s V G can reach another vertex t V G, the answer gven by Equaton 1 s correct. PROOF. The proof follows drectly from Lemmas 8 and REMOVING DUMMY VERTICES The vertex labels constructed n Secton 4 contan dummy vertces, whch may take up a lot of space and ncur extra processng n query answerng. In ths secton, we propose a new label wth all dummy vertces removed. Accordng to Procedure 1, a dummy vertex w s created only as ether an out-neghbor of u or an n-neghbor of v for a crosslevel edge (u, v). If w s created as an out-neghbor of u (or an n-neghbor of v), then u (or v) s called the n-source vertex (or out-source vertex)of w, denoted by src(w) =u (orsrc(w) =v). If src(w) =v s a vertex n G,.e., v s not a dummy vertex, then v s called the root vertex of w, denoted by rt(w). In general, we have rt(w) =src(src( src(w) )). Wth the defnton of n-source/out-source vertces and root vertces, we defne a new vertex label as follows. DEFINITION 6 (VERTEX LABEL WITHOUT DUMMIES). Let f(u) be a functon such that f(u) =rt(u) f u s a dummy vertex, and f(u) = u otherwse. The new labels of a vertex v V G, denoted by label2 n(v) and label2 out(v), are defned as follows: label2 n(v) ={f(u) :u label n(v)}. label2 out(v) ={f(u) :u label out(v)}. Intutvely, label2 n(v) s obtaned by replacng every dummy vertex u n label n(v) wth rt(u), and smlarly for label2 out(v). For all v V G, label2 n(v) label n(v) and label2 out(v) label out(v), snce there can be multple dummy vertces wth the same root vertex and/or the root vertex may already exst n the set. Thus, compared wth label, label2 reduces ndex storage space and mproves queryng effcency. The followng lemma and theorem prove the correctness of query answerng usng label2. LEMMA 10. Gven s, t V G, (1) f x label out(s) and rt(x) / label out(s), then s rt(x) n G; and (2) f x label n(t) and rt(x) / label n(t),thenrt(x) t n G. PROOF. We frst prove (1). From the proof of Lemma 8, x label out(s) mples a sequence S = s = u 1,...,u α = x,where for 1 < α the edge (u,u +1) s n G tf(u ). Snce x s a dummy vertex, accordng to Procedure 1 there exsts another sequence S 2 = rt(x) =v 1,...,v β 1 = src(x),v β = x, where for 1 <β: ether the edge (v,v +1) s n G tf(v ) f rt(x) s an n-source vertex, or (v +1,v ) s n G tf(v ) f rt(x) s an out-source vertex. If rt(x) s an n-source vertex, then we construct the proof as follows. Let y = x. Start from = α 1 to = 2,wereassgn y = u f u = src(y) (note that 1snce s = u 1 = rt(x) contradcts rt(x) / label out(s)). Let s = u 1,...,u α = y be the sub-sequence such that u α 1 src(y). Accordng to Procedure 1, u α 1 s an n-neghbor of rt(x) so that u α 1 s also connected to v 2 n G tf(rt(x)) to preserve the reachablty from u α 1 to rt(x) s cross-level out-neghbors (now va v 2). Note that v 2 may not be n label out(s),.e., S, because v 2 may not be an out-neghbor of u α 1 n G tf(u α 1 ),.e., tf(v 2) <tf(u α 1). Thus, we have the sequence s = u 1,...,u α 1,rt(x), where (u α 1,rt(x)) n G tf(rt(x)), from whch we have s rt(x) n G by Lemma 7. If rt(x) s an out-source vertex, then we have s = u 1,...,u α = x = v β,v β 1 = src(x),...,v 1 = rt(x). Agan, by Lemma 7 we have s rt(x) n G. Smlarly we can prove (2). THEOREM 2. Gven a reachablty query whether a vertex s V G can reach another vertex t V G, the answer gven by Equaton 1 wth label replaced by label2 s correct. PROOF. Let X = label out(s) label n(t) and X2 = label2 out(s) label2 n(t). Weshowthat(1) f X, then X2,and(2) f X =,thenx2 =. We frst prove (1). IfX, then ether () x X, x s not a dummy vertex, or () x X, x s a dummy vertex. For (), x s also n X2 accordng to Defnton 6 and hence X2. For(), rt(x) s n X2 and hence X2. We now prove (2). Suppose to the contrary that X2, whch must be caused by the replacement of some dummy vertex x by rt(x),.e., rt(x) X2 for some dummy vertex x. Wehavethe followng possble cases:

7 vertex label2 out label2 n a {a,c,d,e,f} {a} b {b, d, f} {b} e {e, f} {c, e} g {g, h} {e,f,g} c {c,e,f} {c} d {d, f} {d} h {h} {a,e,f,h} f {f} {f} a b c d f h j k l m e g 1 c d e 2 3 h j k f 1 n (b) G 2 Table 2: Removng dummy vertces from the labels n Table 1 () If x label out(s) and rt(x) / label out(s): thenwehave rt(x) label2 out(s) as a replacement of x. Thus, by Lemma 10, we have s rt(x) n G. Otherwse, rt(x) s orgnally n label out(s) snce rt(x) X2. Thus, we have rt(x) =s,ors rt(x) n G by Lemma 8sncert(x) label out(s) and rt(x) label n(rt(x)). () If x label n(t) and rt(x) / label n(t): then smlarly as () we have rt(x) t n G by Lemma 10. Otherwse, smlarly as () we have ether rt(x) = t, or rt(x) t n G. For every combnaton of the cases n () and () above, we have s t n G, whch mples X by Lemma 9 and thus a contradcton. Therefore, we have our result that X = mples X2 =. Gven (1) and (2), the correctness of the theorem follows drectly from Theorem 1. The followng example llustrates the concept of label2. EXAMPLE 5. Table 2 shows the labelng of the same graph n Example 3 wth dummy vertces removed. In Table 1, we have label out(b) ={b, b 1,d,f}, butlabel2 out(b) ={b, d, f} n Table 2sncert(b 1)=balready exsts n label out(b). Forlabel out(c) = {c, e 1,f} n Table 1, we replace dummy vertex e 1 wth rt(e 1)=e and obtan label2 out(c) = {c, e, f} n 2. Smlarly, we obtan label2 for all other vertces n G. 6. HANDLING HIGH-DEGREE VERTICES In the constructon of G +1 from G,orG from G,manynew edges may be created to connect the n-neghbors of a vertex v to v s out-neghbors. Although such connectons are necessary to preserve reachablty after v s removed, the constructon s costly n the presence of hgh-degree vertces snce the number of edges created s gven by (deg n (v, G ) deg out (v, G )). The followng example llustrates the problem caused by hgh-degree vertces. EXAMPLE 6. Consder the example n Fgure 3(a), f s a hghdegree vertex wth deg n (f,g 1) deg out (f,g 1)=3 5 =15.By Procedure 1, f s removed at the frst teraton and we need to add many edges n order to mantan reachablty n G 2 as shown n Fgure 3(b). In the DAG of many real graphs, often we have a few vertces wth very hgh degree (these vertces normally correspond to gant SCCs n the orgnal drected graph). For example, n the p2p dataset, we have a vertex v wth deg n (v, G 1) = and deg out (v, G 1) = 366. Such hgh-degree vertces wll take up a lot of space n the ntermedate graphs and hence ncur a sgnfcant amount of extra processng n the overall labelng process. Here we propose a method to address ths problem. For smplcty, n the subsequent dscusson we focus on handlng hgh-degree 6 n (a) G=G 1 1 h j f 1 k (c) G 3 Fgure 3: Problem caused by hgh-degree vertces vertces n G 1 = G, but we remark that the method apples to other G n the same way. Gven a vertex v V G, we defne the set of vertces that are reachable from v as reach out(v, G) ={u : v u}, and the set of vertces that can reach v as reach n(v, G) ={u : u v}. LetH be the set of top-k hgh-degree vertces defned as follows: h H and v V G\H, (deg n (h, G) deg out (h, G)) (deg n (v, G) deg out (v, G)). We may set k as the h-ndex value of a graph [8, 9]. We propose a new vertex label of a vertex v V G, denoted by label3 n(v) and label3 out(v), whch have dummy vertces removed as n Secton 5 and hgh-degree vertces handled as follows: 1. For each h H, label3 n(h)={h} and label3 out(h)={h}. 2. For each v V G\H, ntalze label3 n(v) = {h : h H, v reach out(h, G)} and label3 out(v) = {h : h H, v reach n(h, G)}. 3. Remove all vertces n H, together wth all edges ncdent to them, from G. LetG be the remanng graph. 4. For each v V G (.e., v V G\H), construct label2 n(v) and label2 out(v) from G as dscussed n Sectons For each v V G\H, label3 n(v)=label2 n(v) label3 n(v) and label3 out(v) =label2 out(v) label3 out(v). The followng theorem proves the correctness of reachablty query answerng usng label3 obtaned from the above steps. THEOREM 3. Gven a reachablty query whether a vertex s V G can reach another vertex t V G, the answer gven by Equaton 1 wth label replaced by label3 s correct. PROOF. Frst, we show that f s t n G,.e., there exsts a path P = s,...,t n G, then the answer returned s true. 1. If P contans no vertex n H, thenp must be n the remanng graph G. Thus, query answerng usng label2, whch s constructed from G and contaned n label3, returns true as proved n Theorem If P contans at least one vertex h H, then we must have h label3 out(s) and h label3 n(t). Thus, the answer returned s true. Next, we show that f s t n G, then the answer returned s false. Suppose to the contrary that the answer s true,.e., x (label3 out(s) label3 n(t)).

8 c a d b k g e n (a) G=G 1 l j m h 1 c d e 2 k (b) G 2 1 k (c) G 3 Fgure 4: Topologcal foldng wth hgh-degree vertex removed label3 out label3 n a {a,c,f} {a} b {b,d,e,f,k} {b} c {c,f} {c} d {d,f} {d} e {e,f,k} {e} f {f} {f} g {g,k} {e,g} h {h} {f,h} {,l} {f,} j {j,m} {f,j} k {k} {f,k} l {l} {f,l} m {m} {f,m} n {n} {f,l,m,n} (a) l m label2 out label2 n a {a,c,f,h,,j,k} {a} b {b,d,e,f,h,,j,k} {b} c {c,f,h,,j,k} {c} d {d,f,h,,j,k} {d} e {e,f,h,,j,k} {e} f {f,h,,j,k} {c,d,e,f} g {g, k} {e,g} h {h} {h} {} {} j {j} {j} k {k} {k} l {l,n} {,l} m {m,n} {j,m} n {n} {f,,j,n} (b) Table 3: Labelng for G n Fgure3(a): (a) label3 ;(b)label2 1. If x H, then we have s reach n(x, G) and t reach out(x, G), assumng that x s and x t. Thus, we have s x and x t n G, whch mples s t n G. Nowfx = s or x = t, thent reach out(x = s, G) or s reach n(x = t, G), whch agan mples s t n G. In each case, the result contradcts to the fact that s t n G. 2. If x/ H, thenx label2 out(s) and x label2 n(t), whch mples s t n G by Theorem 2. Snce G s a subgraph of G, wehaves t n G, whch s a contradcton. The followng example further llustrates the dea. EXAMPLE 7. Consder the example n Fgure 3. We frst obtan reach n(f,g) = {a, b, c, d, e} and reach out(f,g) = {h,, j, k, l, m, n}. Then, we ntalze label3 for the vertces: label3 out(v) = {f} for each v {a, b, c, d, e}, and label3 n(v) ={f} for each v {h,, j, k, l, m, n}. Then, we remove f and all edges ncdent to f, whch gves the graph as shown n Fgure 4(a). Next we construct the TF and then label2 from the DAG n Fgure 4(a). Fnally, we merge label2 and label3 to obtan the fnal label3 as shown n Table 3(a). Compared wth label2 computed for the graph n Fgure 3(a), whch s shown n Table 3(b), label3 s consderably smaller. The example also reveals that after removng the hgh-degree vertces, the graph becomes much easer to handle. 7. ALGORITHM AND COMPLEXITY In ths secton, we dscuss the algorthmc and complexty ssues of our proposed method. Our method conssts of two man phases, namely, the pre-processng or ndexng phase and the query Algorthm 1: Labelng(G = {G 1,...,G tf(g)}) 1 Let V G =, where = tf(g)+1; 2 for =1,..., tf(g) do 3 foreach v (V G \V G+1 ) do 4 label n (v) {v} {u :(u, v) G }; 5 label out (v) {v} {u :(v, u) G }; 6 for = tf(g),..., 1 do 7 foreach v (V G \V G+1 ) do 8 foreach u label n (v) do 9 label n (v) label n (v) label n (u); 10 foreach u label out(v) do 11 label out (v) label out (v) label out (u); 12 return label n (v) and label out (v) for all vertces v; processng phase. Query processng s just an ntersecton of two sets whch termnates as soon as the frst common element s found and thus the complexty s bounded by the label sze. The preprocessng phase ncludes computng the DAG from an nput drected graph, topologcal sortng of the resultng DAG, constructon of the transformed TF structure, and the label constructon. The steps before labelng are ether smple or have been presented n suffcent detals. We therefore focus our dscusson on the labelng algorthm here. We propose an effcent top-down algorthm to construct the vertex labels defned n Defnton 5. As shown n Algorthm 1, Lnes 1-5 ntalzes label n(v) and label out(v) for each vertex v to contan the n-neghbors and out-neghbors of v n G tf(v). Note that for each v (V G \V G+1 ), tf(v) = snce v no longer exsts n G +1. Lne 1 s ntroduced so that (V G \V G+1 )=V G when = tf(g) n Lnes 3 and 7, snce G tf(g)+1 does not really exst. Lnes 6-11 performs a top-down operaton startng at the hghest level of the TF structure. At each level, for each vertex v (V G \V G+1 ), we smply nclude the n-label (out-label) of v s nneghbors (out-neghbors) n label n(v) (label out(v)). The correctness of Algorthm 1 follows from Defnton 5 and Lemma 5. Whle the algorthm does not remove dummy vertces, we dscuss how t can be handled wth lttle addtonal overhead, as nspred by the followng lemma. LEMMA 11. For any vertex v V G and any G G,atmost two dummy vertces wll be created n G whose root vertex s v. PROOF. Accordng to Procedure 1, ntally we may create one dummy vertex u out as an out-neghbor of v and/or another dummy vertex u n as an n-neghbor of v. Andu out and u n must be created n G tf(v). At most one dummy vertex (let t be w out) wll be created as an out-neghbor of u out snce all ncomng edges of u out are not cross-level edges by constructon. And w out must be created n G j,wherej = tf(u out). Smlarly, at most one dummy vertex wll be created as an out-neghbor of w out, and so on. A smlar analyss apples to u n and thus n any G G,wehave at most two dummy vertces created whose root vertex s v. If v s the root vertex of any dummy vertex and v s the n-source vertex, then Lemma 11 mples the exstence of a unque sequence S out = v = u 1,...,u α, whereu j 1 s the n-source vertex of u j for 1 <j α; thus, we can use only two labels, label n(u j) and label out(u j), to keep the labels for all dummy vertces u j at each level = tf(u j) n Lnes 6-11 of Algorthm 1. Smlarly, the same strategy apples to another unque sequence f v s the root vertex of a set of dummy vertex and v s the out-source vertex.

9 Thus, n the top-down labelng process, n total we mantan at most four labels for each vertex v V G for all dummy vertces created wth v as ther root vertex. Next we analyze the complexty of the pre-processng phase. Computng the DAG takes lnear tme n the sze of the nput drected graph. Gven the DAG G = (V G,E G), topologcal sortng takes O( V G + E G ) tme. Then, we apply Procedure 1 to construct the TF structure, whch takes O(lg l(g)) teratons of Steps 2 and 3. At the -th teraton, we need O( v V (deg G n (v, G ) deg out (v, G ))) tme for the constructon. From Lemma 11, V G 2 V G and the degree of a dummy vertex w s bounded by that of src(w). The total tme complexty s gven by C1 = O( 1 lg l(g) v V G (deg n (v, G ) deg out (v, G ))). The complexty of Algorthm 1, together wth dummy vertex handlng, s bounded by C2 = O( 1 lg l(g) v (V G \V G+1 ) ( u nb n (v,g ) labeln(u) + u nb out (v,g ) labelout(u))). Both C1 and C2 depend on the characterstcs of the nput DAG, especally the vertex degree. Both C1 and C2 can be sgnfcantly reduced by removng the set of hgh-degree vertces H, whch takes O( H ( V G + E G )) tme to remove H and add h H to the labels of other vertces as dscussed n Secton EXPERIMENTAL EVALUATION We mplemented our method, TF-label, n C++ (source code avalable n authors webpage). We compare TF-label wth the followng state-of-the-art methods for processng reachablty queres: PathTree [19], [28], [25], ScaPathTree and. ScaPathTree and are the applcaton of PathTree and n the SCARAB framework [18],.e., frst computng the backbone of the nput DAG and then applyng PathTree or for reachablty queryng (more detals n Secton 1). Though n theory any exstng method can be appled n SCARAB, we were not able to do so for and TF-label due to unfamlarty wth ther system. ScaPathTree and were provded by the authors of [18]. All source codes of the methods we compare wth are the latest verson provded by ther authors, and all were mplemented n C++ and compled usng the same gcc compler as TF-label. We ran all experments on a computer wth an Intel 3.3 GHz CPU, 16GB RAM, and runnng Ubuntu Lnux OS. 8.1 Performance on Real Datasets We frst evaluate the performance of our method on real-world datasets from a wde spectrum of domans. As shown below, the frst set of 7 datasets are from 3 dfferent domans, whle the second set of 5 datasets are from 5 dfferent domans. We want to examne the dfferences n the spectrum of datasets that our method can handle versus those of exstng methods. Real datasets. We used the followng 7 large real datasets that are used n [18, 28] for scalablty test: cteseer, cteseerx and ct-patent (patent) are ctaton networks, n whch non-leaf vertces have an average out-degree of 10 to 30; go-unprot s the jont graph of Gene Ontologyterm and the annotatons from the UnProt database ( whch s the unversal proten resource; unprot22m, unprot100m and unprot150m are the subsets of the complete RFG graph of UnProt. We also used 5 real datasets from Stanford Large Network Dataset Collecton. We selected one large drected graph from each of the followng categores: emal-euall (emal) from communcaton networks, soc-lvejournal1 (LJ) from so- Table 4: Real datasets (K = ) Dataset V G E G V G E G l(g) d avg cteseer 694K 312K cteseerx 6540K 15011K go-unprot 6968K 34770K patent 3775K 16519K unprot22m 1595K 1595K unprot100m 16087K 16087K unprot150m 25038K 25038K emal 265K 420K 231K 223K LJ 4848K 68994K 971K 1024K p2p 63K 148K 48K 55K web 876K 5105K 372K 518K wk 2394K 5021K 2282K 2312K cal networks, p2p-gnutella31 (p2p) from Internet peerto-peer networks, web-google (web) from Web graphs, and wk-talk (wk) from Wkpeda networks. In addton, ct-patent from ctaton networks s already ncluded n the frst 7 graphs. Detaled descrptons of the datasets can be found n (snap.stanford.edu/data). Table 4 lsts the number of vertces and edges n the orgnal drected graph, G, as well as n the DAG G of G, respectvely. We do not show V G and E G for the datasets obtaned from [28] snce the authors dd not provde these numbers. Note that exstng methods for reachablty queryng assume that the nput s a DAG. We also show the topologcal level number of G, l(g), as well as the average degree of the vertces (denoted by d avg) ng. Indexng Performance. We frst report ndexng performance results, but remark that (onlne) query performance should be the more mportant performance ndcator, provded that (offlne) ndexng performance s reasonable. We report the ndex constructon tme (total elapsed tme n seconds) n Table 5. The shortest tme for each dataset s hghlghted n bold. Table 5: Index constructon tme (n sec) TF-label PathTree ScaPathTree cteseer cteseerx go-unprot patent unprot22m unprot100m unprot150m emal LJ p2p web wk For the datasets from [28], has the best performance and the performance of s close to that of. The ndexng tme of TF-label s comparable to that of for most datasets. For cteseerx and patent, TF-label s 135 and 8.5 tmes faster than. Compared wth ScaPathTree, our method s from a few tmes to 74 tmes faster. ScaPathTree was not able to obtan the results for cteseerx and patent, whle PathTree can only run on cteseer. For the datasets from the Stanford Collecton, TF-label s the best for ndexng all the datasets. TF-label s about twce faster than and on average, and up to orders of magntude faster than, PathTree and ScaPathTree. We note that we dd not specfcally pck these datasets, but rather smply selected one large graph from each category of drected graphs (we dd leave out two categores because the DAGs of these graphs are too small,

10 for whch most exstng methods wll be effcent enough). Therefore, the result shows that our method s able to perform well for graphs from varous domans. Table 6 reports the ndex sze (n MB). For the 3 unprot datasets, TF-label s from about 3 to 10 tmes smaller than all other methods. For cteseer, TF-label s only worse than PathTree, but much better than the other methods. But for cteseerx, patent and go-unprot, TF-label s much larger. However, for the second set of 5 datasets, TF-label s much smaller n all cases except p2p for whch t s larger than PathTree. Table 6: Index or label sze (n MB) TF-label PathTree ScaPathTree cteseer cteseerx go-unprot patent unprot22m unprot100m unprot150m emal 0.9 LJ p2p web wk Overall, the results of ndexng tme and ndex sze show that our method s very compettve n ndexng performance, especally for the datasets from the Stanford Collecton. In fact, only and are able to beat TF-label for ndexng a few datasets. However, next we wll show that and are sgnfcantly slower n query processng than TF-label for all datasets. Query Performance. We randomly generate 1 mllon queres for each dataset and Table 7 reports the total tme taken to run the queres (the shortest tme for each dataset s hghlghted n bold). Table 7: Total query processng tme (n mll-sec) TF-label PathTree ScaPathTree cteseer cteseerx go-unprot patent unprot22m unprot100m unprot150m emal LJ p2p web wk The result clearly shows that TF-label outperforms all other methods n all cases except for p2p, for whch TF-label s comparable wth. can run on all datasets, but s from about 2 to 32 tmes slower than TF-label. ScaPathTree and are also sgnfcantly slower than TF-label, and they cannot scale to run on a number of datasets. s up to orders of magntude slower than TF-label and PathTree cannot scale for processng most of the datasets. Another mportant feature of TF-label s that t has stable goodperformance for all datasets, unlke the other methods whch are slow for processng some datasets. For example, s partcularly slow n processng web, for whch ScaPathTree and perform reasonably well. Smlarly, ScaPathTree s slow n processng unprot150m and s slow n processng patent. Such a stable performance from TF-label s mportant for handlng datasets from varous applcaton domans. We also emphasze that TF-label can be further appled n the SCARAB framework, as do and ScaPathTree, to mprove the performance. Thus, our result s mpressve snce TFlabel even sgnfcantly outperforms the exstng methods appled n SCARAB. In the next experment, we show that TF-label scales well where all exstng methods, ncludng SCARAB, cannot scale, for both ndexng and queryng. 8.2 Scalablty and Effects of Varous Graph Propertes We use synthetc datasets to control the dfferent propertes of the DAG graph and hence assess ther effects on the performance of our method, for both effcency and scalablty. Synthetc datasets. We consder three mportant propertes of the DAG graph: (1) the number of vertces (V G), (2) the average vertex degree (d avg), and (3) the number of topologcal levels (l(g)). We generate three categores of datasets as follows (let M =10 6 ): (C1) Fx d avg =3and l(g) =7, then: set V G =5M, 10M, 20M, 40M and 80M, respectvely. (C2) Fx V G =1M and l(g) =7, then: set d avg =10, 20, 30, 40 and 50, respectvely. (C3) Fx V G =1M and d avg =3, then: set l(g) =3, 7, 15, 31 and 63, respectvely. For the generaton of a DAG G wth V G vertces, l(g) levels, and average degree d avg, we frst create V G vertces and dstrbute them to the l(g) levels. Then, for each vertex v at each level, where 1 << l(g), we add one edge from a vertex selected randomly at level 1 to v, and add edges from v to (d avg 1) randomly selected vertces at level j>n G. To test query performance, we randomly generate 1 mllon queres for each dataset. Effect of number of vertces. Fgure 5 reports the performance results of processng the (C1) datasets, where we vary the number of vertces V G from 5M to 80M (M =10 6 ). For ndex constructon, TF-label s sgnfcantly faster than all other methods except. Compared wth, TF-label s slower when V G 20M, but s 3 tmes faster when V G 40M. When V G =80M, all other methods faled (we termnated after t took two orders of magntude longer tme than ours). could only handle 5M vertces, whle PathTree faled even wth 5M vertces (thus not shown n Fgure 5). Moreover, ScaPathTree and also cannot scale well, snce SCARAB faled to construct the backbone for such large datasets. The ndex sze of TF-label s about twce that of, and s 1.5 to 3 tmes smaller than that of the other methods (for the datasets they can handle). For query processng, TF-label s agan sgnfcantly faster than all the other methods. Moreover, we also see that s the slowest and s over an order of magntude slower than TF-label. When V G =40M, s 6400 tmes slower than TF-label. Overall, TF-label s shown to be much more scalable than the exstng methods wth the ncrease n the number of vertces,.e., also n the graph sze. The results also show that the ndexng performance of TF-label scales lnearly wth the ncrease n the graph sze, but remans reasonably stable n query performance. The reason that query tme does not ncreases much when the graph sze ncreases s because the average label sze remans stable, whch can be observed as the ndex sze ncreases only lnearly. Effect of average vertex degree. Fgure 6 reports the performance results of processng the (C2) datasets, where we vary the average vertex degree from 10 to 50.

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,