Scalable Diversified Ranking on Large Graphs

Size: px

Start display at page:

Download "Scalable Diversified Ranking on Large Graphs"

Giles Young
5 years ago
Views:

1 IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 22 Scalable Diversified Rakig o Large Graphs Rog-Hua Li ad Jeffery Xu Yu Abstract Ehacig diversity i rakig o graphs has bee idetified as a importat retrieval ad miig task. Nevertheless, may existig diversified rakig algorithms either caot be scalable to large graphs due to the time or memory requiremets, or lack a ituitive ad reasoable diversified rakig measure. I this paper, we propose a ew diversified rakig measure o large graphs, which captures both relevace ad diversity, ad formulate the diversified rakig problem as a submodular set fuctio maximizatio problem. Based o the submodularity of the proposed measure, we develop a efficiet greedy algorithm with liear time ad space complexity w.r.t. the size of the graph to achieve ear-optimal diversified rakig. I additio, we preset a geeralized diversified rakig measure ad give a ear-optimal radomized greedy algorithm with liear time ad space complexity for optimizig it. We evaluate the proposed methods through extesive experimets o five real datasets. The experimetal results demostrate the effectiveess ad efficiecy of the proposed algorithms. Idex Terms Diversified Rakig, Graph Algorithms, Scalability, Flajolet-Marti sketch, Submodular Fuctio. INTRODUCTION Rakig odes o graphs is a fudametal task i iformatio retrieval, data miig, ad social etwork aalysis. It has a large umber of applicatios such as rakig web-pages [], measurig cetrality i social etworks [2], as well as ehacig persoalized services for web search [3]. Most of existig graphbased rakig algorithms are based o the statioary distributio of the radom walk o graphs, such as the PageRak algorithm [] ad its variats [3][4]. The idea of this radom walk based rakig algorithms is that the ode of a graph should be raked higher if there are more high-rakig odes lik to it. This basic idea has become a crucial criteria for desigig rakig algorithms o graphs ad also has bee successfully applied i may applicatios. However, as discussed i [5][6], the desig criteria lead to may odes foud i the top- rakig list are similar because it oly cosiders the relevace of the odes. It reduces the rakig effectiveess whe the applicatios eed to icorporate diversity ito the top- rakig results. Take Flickr ( com), which is a well kow photo shared website, as a example. Users i Flickr ca make frieds ad joi i may iterest groups. Cosider a retrieval task of fidig the top- relevat users who are similar to a give user but are from as may iterest groups as possible i the Flickr social etwork. I geeral, we ca use persoalized PageRak algorithms [][3][4] to rak the users, ad the fid the top- users based o their persoalized PageRak scores. However, the top- users foud by the persoalized PageRak typically iclude may users who are i the same iterests group, thereby they caot meet our The Chiese Uiversity of Hog og, {rhli,yu}@se.cuhk.edu.hk objective of diversity. To this ed, we eed to take the diversity of the top- rakig list ito accout for desigig rakig algorithms. I other words, the rakig algorithms i this case should produce diversified rakig results so as to cover as may groups as possible. Recetly, improvig diversity i top- rakig results has attracted much attetio as it has a variety of applicatios i iformatio retrieval ad data miig areas. There exists a large body of work o search results diversificatio both i text ad graph datasets respectively. I this paper, we focus o ehacig diversity i rakig o graph datasets. We are iterested i fidig the top- odes that are ot oly relevat to the query but also dissimilar to oe aother. Here the relevace of the odes is measured by their persoalized PageRak scores. I the literature, there are four frameworks for diversified rakig o graphs. The first oe is based o a greedy vertex selectio procedure [5][7], the secod oe is based o a so-called vertex reiforced radom walk [6], the third framework is based o optimizig the predefied diversified measures [8][9], ad the last oe is based o the resistive graph ceters []. I particular, the greedy vertex selectio procedure chooses a vertex with a maximum radom walk based rakig score at a time, ad the removes the selected vertex from the graph. To get the top- rakig list, this process repeats times. To the best of our kowledge, there are two algorithms based o this framework: the Grasshopper algorithm [5] ad the maifold rak with stop poits algorithm [7]. Both algorithms have empirically show that they ca improve diversity i rakig o graph data. However, the major drawback of this type of algorithms is that they have cubic time complexity, thus they caot be scalable to large graphs. Aother drawback of this type of algorithms is that they lack a theoretical explaatio

2 IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 22 2 for the algorithms why they ca improve diversity i rakig results. Some improvemets of this poit have bee achieved i the secod framework [6]. I [6], Mei, et al. propose a diversified rakig algorithm, called DivRak, based o a vertex reiforced radom walk, ad preset a optimizatio explaatio for DivRak to improve diversity i rakig. However, the explaatio is oly suitable for udirected graphs. I additio, the covergece property of DivRak is ot clear, because it resorts to some approximatio strategies to the origial vertex reiforced radom walks. Aother drawback of DivRak is that it caot be scalable to large graphs for two reasos. O oe had, DivRak dyamically updates the trasitio matrix at each iteratio. This procedure may result i a full trasitio matrix, thus it caot be stored i mai memory if the graph is very large. O the other had, the full trasitio matrix icreases the computatioal cost for the matrix-vector multiplicatio. Tog, et al. i [8] propose a scalable diversified rakig algorithm by optimizig a predefied diversified rakig measure. However, the motivatio of their diversified rakig measure is ot explicitly clarified. Specifically, for measurig diversity, their measure is based o a multiplicatio of the so-called Google matrix ad the persoalized PageRak vector, which lacks a clear topological explaatio. Hece, it does ot directly reflect diversity of a set of odes from graph structural perspective. The last otable diversified rakig algorithm is based o resistive graph ceters []. Similar to the greedy vertex selectio algorithms, the time complexity of this algorithm is cubic, thus it caot scale to large graphs. To overcome the problems i the existig algorithms, i this paper, we preset a ovel diversified rakig method o graphs. The basic idea of our approach is that we first calculate the persoalized PageRak vector o the basis of the query ode, ad the perform a carefully desiged vertex selectio algorithm to fid the top- diversified rakig list accordig to a predefied diversified rakig measure. The key challeges i our method are () how to defie a ituitive ad reasoable diversified rakig measure that captures both relevace ad diversity, ad (2) how to develop a efficiet vertex selectio algorithm to optimize the diversified rakig measure. To this ed, firstly, we propose a modified defiitio of expasio o graph to capture the diversity of the odes. The key ituitio is that if the odes have large expasio, the the odes will be dissimilar to each other, thus leadig to diversity. Secodly, based o this defiitio, we propose a ovel diversified rakig measure by combiig relevace ad diversity. We show that the proposed measure is a odecreasig submodular set fuctio. Based o the submodularity of the proposed measure, we desig a efficiet greedy algorithm with liear time ad space complexity w.r.t. the size of the graph to fid the top- diversified rakig list. Thirdly, we further preset a geeralized diversified rakig measure based o the defiitio of k-step expasio, ad propose a radomized greedy algorithm with liear time ad space complexity to optimize it accurately. Fially, we compare our proposed methods with six existig algorithms o five real etworks. The experimetal results demostrate the effectiveess, efficiecy ad scalability of the proposed algorithms. The prelimiary study of this work is reported i [9]. The rest of this paper is orgaized as follows. We give a briefly review of persoalized PageRak algorithm ad preset our ew diversified rakig measure as well as our problem formulatio i Sectio 2. We show the submodularity of the proposed measure ad give a ear-optimal greedy algorithm for fidig top- diversified rakig i Sectio 3. We preset a geeralized diversified rakig measure ad a radomized greedy algorithm i Sectio 4. Extesive experimets are reported i 5, ad related work is discussed i Sectio 6. We coclude this work i Sectio 7. 2 PRELIMINARIES I this sectio, we first briefly review the persoalized PageRak algorithm that is used as a basic measure of relevace i diversified rakig o graphs. The, we propose a ew diversified rakig measure ad formulate our diversified rakig problem as a discrete optimizatio problem. 2. Persoalized PageRak algorithm Persoalized PageRak [][3][4] is a well kow approach for query-depedet rakig o graphs, ad it has bee successfully used i various applicatios i the past decades. We briefly describe the persoalized PageRak algorithm below. Give a query vector r (also call teleport vector i may literature []), ad a graph G. The, the persoalized PageRak vector w ca be calculated by the followig iterative equatio: w = ( α)r + αa T w, () where α is a dampig factor, ad A is the adjacecy matrix of graph G. The iterative equatio i Eq. () ca coverge to a fixed poit, which correspods to the statioary distributio of the Markov chai. The resultig vector w will be utilized to rak the odes of the graph. However, the persoalized PageRak does ot cosider diversity of the rakig results. This is because the persoalized PageRak makes use of the statioary distributio of the radom walks for rakig odes i graph. The radom walk o graph ca form a Markov chai. By the fudametal theorem of Markov chai [], the statioary distributio of the walks is iversely proportioal to the hittig time. If a

3 IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 22 3 ode is hit very frequetly by radom walks, the the ode will have a high persoalized PageRak score. Also, if a ode is hit frequetly, all its eighbors are most likely to be hit frequetly, thus its eighbors also get high persoalized PageRak scores. Obviously, this process spreads to may adjacet odes i the top- rakig results. I other words, the top- rakig list foud by the persoalized PageRak may cotai may similar odes, which reduces the rakig effectiveess i the applicatios that eed to icorporate diversity. 2.2 Problem formulatio I the literature, there are may rakig algorithms o graphs [5][6][7] that aim at improvig diversity. However, as our aalysis give i the itroductio, the existig diversified rakig algorithms either caot scale to large-scale graphs or lack a ituitive ad reasoable diversified rakig measure. To this ed, i this paper, we propose a ew diversified rakig measure o graphs ad desig a scalable algorithm for optimizig it accurately. Below, we first give some importat otatios ad defiitios, ad the formulate our diversified rakig problem. Notatios ad defiitios: Cosider a graph G = (V, E), with a set of odes V ad a set of edges E, where the size of odes is = V. Defiitio 2.: Let S be a set of odes. The expaded set of S is deoted by N(S) such that N(S) = S {v (V S) u S, (u, v) E}. The expasio of a set of odes, S, is the size of the expaded set, N(S), deoted as N(S). Ad the expasio ratio is defied as σ = N(S) /. It is worth metioig that our defiitio of expasio is based o the topological structure of the graph. which ca be either udirected or directed. I additio, it is importat to ote that our defiitio of expasio is differet from the defiitio of expasio give i the expader graph [2] where the expasio of a graph equals to the miimum expasio ratio amog all the expaded sets. With Def. 2., a set of odes with a large expasio ratio implies that the odes are dissimilar to oe aother. Here, the ituitio behid is that two odes are dissimilar if they do ot share the commo eighbors i a graph. The larger expasio ratio the set of odes has, the better diversity amog the set of odes they ca achieve. Cosider a graph i Fig. (a). Assume we select three odes (red odes) i Fig. (b) ad Fig. (c), respectively. The, the expasio ratio of the selected odes i Fig. (b) ad Fig. (c) are.6 ad.9 respectively. The selected odes i Fig. (b) are well coected, thus they ca be similar to oe aother. O the other had, there is o edge betwee ay two selected odes i Fig. (c), thus they ca be dissimilar to each other. As a result, the selected odes i Fig. (c) are more diverse tha the selected (a) A graph G (b) σ =.6 (c) σ =.9 Fig.. Illustratio of our idea: expasio ratio vs diversity. Red square odes deote the selected odes ad gree odes are the expaded odes (color olie). odes i Fig. (b). This example idicates that odes with a larger expasio ratio result i better diversity. Our diversified rakig measure is based o this key ituitio. Diversified rakig measure: The most commoly used criteria for combiig relevace ad diversity are the so-called maximum margial relevace (MMR) [3], which is a liear combiatio of relevace ad diversity ad is widely used i may documet retrieval systems. With MMR, a documet that has a high margial relevace meas that it is relevat to the query ad is dissimilarity to the previously selected documets. Similarly, i a graph, a ode with a high diversified rak should () have a high persoalized PageRak score, ad (2) be dissimilar to the other selected odes. Our defiitio of expasio ratio ca be deemed as a diversity measure. Ad we aim at fidig a subset S of odes such that () the odes i S have high persoalized PageRak scores ad (2) the expasio ratio of N(S) / is maximum. Formally, our goal is to maximize the followig diversified rakig measure: F (S) = ( λ) w u + λ N(S), (2) u S where w u deotes the persoalized PageRak score of ode u, ad λ [, ] is a parameter that is used to tradeoff relevace ad diversity. The first term i Eq. (2) is the sum of the persoalized PageRak scores over the rakig results, which reflects the relevace of the rakig results. The secod term is the expasio ratio of the rakig results. As discussed, a better expasio ratio implies better diversity. Hece, Eq. (2) captures both relevace ad diversity. Note that F (S) does ot cosider the orderig of the top- rakig list. This is because our defiitio is based o a mild assumptio that the users i a real retrieval system geerally focus o all the top- results. This assumptio is typically reasoable i may practical applicatios [5][6][7]. However, i Sectio 3.3, we will show that our proposed algorithm still yields a orderig results based o both relevace ad diversity score of the ode. To summarize, our problem of fidig top- diversified rakig o graph is formalized as follows: arg max F (S) S V s.t. S =. (3)

4 IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, DIVERSIFIED RANING ALGORITHM As discussed, our diversified rakig problem is to maximize the proposed diversified rakig measure subject to a cardiality costrait (Eq. (3)). The followig theorem shows that the problem formulated i Eq. (3) is NP-hard i geeral graphs. Theorem 3.: For a geeral graph G = (V, E), the optimizatio problem i Eq. (3) is NP-hard. Proof Sketch: We cosider a special case of our problem defied i Eq. (3) ad show it is NP-hard. Let λ =, the the problem is equal to maximize N(S) subject to S =. This special problem is equivalet to the maximal expasio problem defied i [4] which is kow to be NP-hard. As a cosequece, our problem defied i Eq. (3) is also NP-hard. Give the hardess of our problem, there is o hope to optimally solve the top- diversified rakig problem o geeral graphs i polyomial time uless P=NP. Oly o trees, the diversified rakig problem (Eq. (3)) ca be solved optimally i polyomial time by a dyamic programmig algorithm, which we describe i the followig subsectio. 3. Diversified rakig o trees Although the diversified rakig problem o geeral graphs is NP-hard, we show that it ca be solved optimally i polyomial time whe the graph is a tree. Our polyomial-time algorithm is based o dyamic programmig. The basic idea is described as follows. Cosider a subtree whose root has x childre, the optimal way of fidig odes from the subtree for the diversified rakig list must follow oe of two cases. I the first case, we iclude the root of the subtree to the rakig list ad the recurse o the childre with a budget of -. I the secod case, we do ot add the root of the subtree, ad istead recurse o the childre with a budget of. A aive implemetatio of this recursio eeds to partitio x childre ito (or ) parts i all possible ways. Obviously, this is extremely expesive if x 2. To overcome this, we costruct a trasformatio that coverts the geeral tree to a biary tree without alterig optimum. The trasformatio is described as follows. We start from the root of tree T, deoted by root(t ). Assume u is a iteral ode of T with childre u, u 2,, u x ad x > 2. The, we replace u by a biary tree with depth at most log 2 x ad leaves u, u 2,, u x. I particular, let u be left child of u. Add a ew ode z ad let it be the right child of u. The, let the remaider childre of u be the childre of z. Repeat these steps util every odes have at most two childre. We set the persoalized PageRak score of the ewly added odes to ad the persoalized PageRak score of u, u, u 2,, u x are the same as before. This ca esure that the ewly added odes will ever be added ito the top- rakig list. Obviously, the depth of the ew tree (a biary tree) is at most a factor of log 2 d max larger tha the depth of the origial tree. Here d max deotes the maximum out-degree of a ode i the origial tree. Further, the size of the biary tree is at most twice the size of the origial tree. More importatly, it is ot very hard to verify that the optimal solutio of Eq. (3) o the biary tree is the same as the optimal solutio o the origial tree. Similar costructios have bee used for various applicatios [5][6]. Based o this costructio, we ca assume the tree is biary, ad is deoted by T. For each ode u i T, we defie a cost fuctio w.r.t. the curret solutio S as C(u, S) = ( λ)w u + λ N({u}) N(S) /. Let F (u, S, k) be the optimal solutio i the subtree rooted by u with budget k, where the set S maitais the curret solutio. Ad let l(u) (r(u)) deotes the left (right) child of ode u. The, the recursive equatio of the dyamic programmig (DP) is give by F (u, S, k) = max{ max k i= {F (l(u), S, i) + F (r(u), S, k i)}, C(u, S) + max k i= {F (l(u), S {u}, i) +F (r(u), S {u}, k i)}}. The first term of the recursive equatio correspods to do ot select u to be i S ad the secod term correspods to add u ito S. We aalyze the time complexity of the DP algorithm as follows (here we use a budget of ). First, buildig the biary tree takes O( log 2 d max ) time. Secod, we eed to evaluate the recursio O() times for each ode i the biary tree. For each such evaluatio, it takes O() time. Notice that computig C(u, S) ca be doe i costat time i a biary tree. There are O( log 2 d max )) odes i the biary tree. Puttig all it together, the time complexity of the DP algorithm is O( 2 log 2 d max )). 3.2 Submodularity Sice the diversified rakig problem o geeral graphs is NP-hard, we resort to develop approximate algorithms for solvig it efficietly. Below, we prove that our proposed diversified rakig measure (F (S)) is a odecreasig submodular set fuctio, which allows us to develop a ear-optimal greedy algorithm for maximizig it efficietly. We give the defiitio of the odecreasig submodular set fuctio [7] as follows. Defiitio 3.: Let V be a fiite set, a real valued fuctio f(s) o the set of subsets of V, S, is called a odecreasig submodular set fuctio, if the followig coditios hold. Nodecreasig: For ay subsets S ad T of V such that S T V, we have f(s) f(t ). Submodularity: Let ρ j (S) = f(s {j}) f(s) be the margial gai. The, for ay subsets S ad T of V such that S T V ad j V \T, we have ρ j (S) ρ j (T ).

5 IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 22 5 We prove that Eq. (2) is a odecreasig submodular fuctio with F ( ) =, where is a empty set. We state the theorem as follows. Theorem 3.2: The set fuctio F (S) defied i Eq. (2) is a odecreasig submodular fuctio with F ( ) =. Proof: For S T V ad j V \T, let ρ j (S) = F (S {j}) F (S), ad ρ j (T ) = F (T {j}) F (T ). The, we have N(T {j}) N(T ) ρ j (T ) = ( λ)w j + λ = ( λ)w j + λ. N({j}) N(T ) Note that the odecreasig property of F (S) ca be guarateed by ρ j (T ). Similarly, we have ρ j (S) = ( λ)w j + λ N({j}) N(S). By defiitio, we have F ( ) = ad N({j}) N(S) N({j}) N(T ). Hece, we coclude ρ j (S) ρ j (T ). This completes the proof. 3.3 The greedy algorithm Because our diversified rakig measure exhibits submodularity property, with the foudig i [7], we develop a efficiet greedy algorithm with a /e approximatio guaratee for our top- diversified rakig problem. Alg. outlie our greedy algorithm. I Alg., the algorithm first computes the persoalized PageRak vector as the iitial rakig (lie ), which measures the relevace of the odes. The, i each iteratio, the algorithm chooses a ode u with the maximum margial gai ρ u (S) = ( λ)w u + λ N({u}) N(S) (lie 7-5), ad adds it ito the aswer set S. To get the top- rakig list, this procedure will repeat times (lie 4-7). The algorithm will produce a orderig rakig list accordig to ρ u (S). Sice ρ u (S) satisfies the odecreasig properties, Alg. will output a reasoable rakig such that the ode with a high rakig score will appear i the top rakig list. Theoretically, the followig theorem shows that Alg. obtais a ear-optimal solutio. Theorem 3.3: Alg. is a /e approximatio algorithm for the top- diversified rakig problem (Eq. (3)). Proof Sketch: This ca be proved by a similar argumet that has bee used to prove the approximatio factor of the greedy algorithm for submodular set fuctio maximizatio problem [7]. It is worth metioig that the /e approximatio factor is tight [8]. I other words, there are o other polyomial-time algorithms that ca achieve a more tight approximatio factor uless P=NP. Below, we aalyze the time ad space complexity of Alg.. Complexity aalysis of the greedy algorithm: The time complexity of Alg. is O( E ). Specifically, i lie, Alg. takes O( E ) time to compute the (4) Algorithm The Greedy Algorithm Iput: Graph G = (V, E),, dampig factor α, adjacecy matrix A, teleport vector r, ad parameter λ Output: A set S with odes : Compute the persoalized PageRak vector w; 2: Iitialize the aswer set S ; 3: For each ode v i, iitialize a idicator array Expa[i] ; 4: for iter = to do 5: max ; 6: maxidx ; 7: for each ode v i (V S) do 8: couter ; 9: for each eighbor ode (v j) of v i do : if Expa[j] = the : couter couter + ; 2: if (( λ)w i + λ couter/ V ) > max the 3: max ( λ)w i + λ couter/ V ; 4: maxidx i; 5: S S {v maxidx }; 6: for each eighbor ode (v j) of v maxidx do 7: Expa[j] ; 8: retur S; persoalized PageRak vector. The time complexity from lie 4 to lie 7 is O( E ). This is because the algorithm eeds to visit all the odes ad their correspodig eighbors, ad the total umber of odes visitig by the algorithm equals to 2 E i the worse-case. Moreover, we ca use the so-called CELF framework to accelerate Alg., which will result i several times speedup [9]. For the space complexity, Alg. eeds to store the iput graph G, the persoalized PageRak vector w, the aswer set S, ad a idicator array, which lead to O( V + E ) i total. Put it all together, the algorithm has liear time ad space complexity w.r.t. the graph size, ad thus it ca be scalable to large-scale graphs. 3.4 Coectio to domiatig set problem The proposed top- diversified rakig problem (Eq. (3)) is well coected to the domiatig set problem i graph theory [2]. The miimum domiatig set problem i graph theory aims to fid the miimum umber of odes whose expaded set ca cover the whole graph. I other words, the odes i the miimum domiatig set ca domiate the other odes of the graph. The domiatio umber (DN) of a graph is the cardiality of the miimum domiatig set. It is well kow that the miimum domiatig set problem is NP-hard. There is a efficiet greedy algorithm with + l( V ) approximatio factor to compute the DN ad the domiatig set of a graph [2]. Specifically, the greedy algorithm chooses a ode with the maximal margial gai (ρ u (S) = N u (S {u}) N u (S) ) at a time, ad it termiates whe the expaded set of the selected odes cover the whole graph. Note that the miimum domiatig set problem oly cosiders the expasio of the odes

6 IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 22 6 ad igore the relevace of the odes, thus caot be directly applied to our problem. Moreover, our top- diversified rakig problem aims to fid the odes such that they are relevat to the query ad simultaeously dissimilar to oe aother, ad it is ot to fid the miimum umber of odes such that their expaded set ca cover the whole graph. I the case that exceeds the domiace umber (DN) of the graph, Alg. will choose odes i terms of their persoalized PageRak scores. However, i may real graphs, is sigificatly smaller tha the DN of the graph. We will address this poit i our experimetal studies i Sectio 5. 4 GENERALIZED DIVERSIFIED RANING I this sectio, we first propose a geeralized diversified rakig measure, ad desig a efficiet greedy algorithm for optimize it accurately. The, we discuss other potetial variats of our diversified rakig measures. 4. Geeralized diversified rakig measure The proposed diversified rakig measure (F (S)) i Def. 2., oly cosiders the immediate eighborhood iformatio of S. Naturally, we ca geeralize the diversified rakig measure F (S) by takig the k- step earest eighbors ito accout. We call such a measure a geeralized diversified rakig measure ad deote it by F k (S). I the followig, we first give the defiitios of k-step expaded set ad k-step expasio. Defiitio 4.: Let S be a set of odes. The k-step expaded set of S is deoted by N k (S) such that N k (S) = S {v (V S) u S, d(u, v) k}, where d(u, v) deotes the legth of the shortest path from u to v. The k-step expasio of S is the cardiality of the k-step expaded set deoted as N k (S). Ad the k-step expasio ratio is defied as σ k = N k (S) /. Based o the k-step expasio, we defie the geeralized diversified rakig measure F k (S) as follows. F k (S) = ( λ) u S w u + λ N k(s) Obviously, F (S) is a special case of F k (S) whe k =. Like F (S), F k (S) is also a odecreasig submodular fuctio. We give a theorem as follows. The proof is similar to the proof of Theorem 3.2, thus we omit it for brevity. Theorem 4.: The set fuctio F k (S) defied i Eq. (5) is a odecreasig submodular fuctio with F k ( ) =, where deotes a empty set. Likewise, the problem of maximizig the set fuctio F k (S) subject to a cardiality costrait is NPhard. However, based o the submodularity property. Here, we use small letter k to distiguish which is used to deote the cardiality of our top- rakig results. (5) of F k (S), we ca develop a greedy algorithm to optimize it accurately. Now, the problem is that the greedy algorithm eeds to fid a ode with the maximum margial gai ρ u (S) = ( λ)w u + λ N k({u}) N k (S) i each iteratio. Ulike Alg., the margial gai ρ u (S) caot be calculated i liear time complexity whe k >. A aive implemetatio of maximizig F k (S) is described as follows. First, we costruct a ew graph such that ay two odes u ad v of the ew graph have a edge (u, v) if u ca reach v i k (k > ) hops i the origial graph. The, we perform Alg. o the ew graph. The costructio of the ew graph ca be implemeted by Floyd algorithm [2], resultig i O( V 3 ) time complexity. Ad performig Alg. o the ew graph will take O( E ) time complexity, here E deotes the umber of edges i the ew graph. Hece, the time complexity of this aive algorithm is O( V 3 ), which is clearly ot scalable. I the followig, we develop a radomized greedy algorithm with liear time complexity usig the Flajolet-Marti (FM) sketch [22]. 4.2 The radomized greedy algorithm Recall that the major time-cosumig step for optimizig the geeralized diversified rakig measure (Eq. (5)) is to evaluate the margial gai (ρ u (S) = ( λ)w u + λ N k(s {u}) N k (S) ). Ispired by the idea of approximate eighbor fuctio [23], we propose a radomized greedy algorithm for the geeralized diversified rakig problem usig the FM sketch. The FM sketch is a probabilistic coutig structure, which ca be used to estimate the umber of distict elemets (cardiality) i a multi-set [22]. Assume the cardiality of a multi-set A is C, the the FM sketch oly uses log C + t bits for estimatig C i high accuracy, where t is a small costat. More specifically, the FM sketch is a bitmap with size s = log C+t. There is a hash fuctio h : A {,, s}, which maps a elemet a i A to a bit i = {,, s} i the bitmap with probability Pr(h(a) = i) = /(2 i+ ). Iitially, all bits i the bitmap is set to. The, each elemet a A is iserted ito the bitmap by settig the correspodig h(a)-th bit to. Fially, a asymptotically ubiased estimatio of the cardiality C ca be obtaied by 2 c /.7735, where c deotes the positio of the least-sigificat zero bit i the bitmap. We ca use multiple hash fuctios to boost the estimatig accuracy. For the sake of brevity, we oly cosider oe hash fuctio to illustrate the algorithm. I additio, a importat property of the FM sketch is that it ca be easily applied to estimate the cardiality of the uio of two multi-sets if these two multi-sets come from the same domai. I particular, we ca costruct a FM sketch with the same size for each multi-set. To estimate the cardiality of the uio of two multi-sets, we oly eed to do a bitwise-or betwee the two FM sketches, ad the estimate the cardiality based o the resultig FM sketch.

7 IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 22 7 Algorithm 2 The Radomized Greedy Algorithm Iput: Graph G = (V, E),, dampig factor α, adjacecy matrix A, teleport vector r, parameter k of the k-step expasio, parameter λ Output: A set S with odes : Compute the persoalized PageRak vector w; 2: Let h : {v,, v } {,, s} be the hash fuctio that maps the odes to a positio of the BITMAP, here s is the size of the BITMAP ; 3: for each ode v i V do 4: Iitialize a BITMAP FM[i] ; 5: Set the h(v i)-bit of FM[i] to ; 6: Iitialize a temporary BITMAP TFM[i] ; 7: for iter = : k do 8: for each ode v i V do 9: TFM[i] FM[i]; : for each edge (v i, v j) E do : FM[i] = (FM[i]) BITWISE-OR (TFM[j]); 2: Iitialize the aswer set S ; 3: Iitialize two BITMAPs NBP, OBP ; 4: c ; 5: for iter = to do 6: max ; 7: maxidx ; 8: for each ode v i (V S) do 9: OBP (NBP) BITWISE-OR (FM[i]); 2: Let t be the positio of the right most bit i the BITMAP OBP; 2: couter 2 t /.7735; 22: couter couter c; 23: if ( λ)w i + λ couter/ V > max the 24: max ( λ)w i + λ couter/ V ; 25: maxidx i; 26: S S {v maxidx }; 27: NBP (NBP) BITWISE-OR (FM[maxIdx]); 28: Let t be the positio of the right most bit i the BITMAP NBP; 29: c 2 t /.7735; 3: retur S; It is worth metioig that there also exist may other probabilistic coutig structures, such as Loglog sketch [24] ad Hyper Loglog sketch [25], but the uio of these sketches caot be easily implemeted by bitwise-or. Therefore, i our problem, we apply the FM sketch to estimate the size of the k-step expasio set, i.e., N k (S). The mai idea of our algorithm is that we costruct a FM sketch to estimate the k-step expasio ( N k ({v}) ) of each ode (v). To estimate the k-step expasio of a set S ( N k (S) ), we oly eed to do S times bitwise-or over all the FM sketches of the odes i S. We depict our algorithm i Alg. 2. Firstly, the algorithm calculates the persoalized PageRak vector w (lie ). Secodly, the algorithm builds V FM sketches for all odes of the graph (lie 2-). Here we make use of the idea of the approximatio eighbor fuctio [23]. Specifically, the idea is based o the observatio that the k-step expaded set of a ode v i is equivalet to the uio of all the (k-)-step expaded sets of the immediate eighbors of v i. More formally, we have N k ({v i }) = N k ({v j }). (6) (v i,v j) E Based o this observatio, we build a FM sketch for each ode v i i a recursive maer (lie 7-). Note that we use the bitwise-or over the FM sketches for implemetig the set uio operatio i Eq. (6) (lie ). Fially, Alg. 2 greedily selects odes accordig to their approximate margial gai (lie 2-3). I particular, we let S be the aswer set, NBP be the FM sketch represetig the expaded set of the aswer set S (N k (S)), c be the k-step expasio of S ( N k (S) ), ad OBP be a temporary FM sketch represetig the expaded set of S {v i }, i.e., N k (S {v i }). Iitially, Alg. 2 sets S to a empty set (lie 2), NBP ad OBP to (lie 3), ad c = (lie 4). The, Alg. 2 iteratively selects odes with the maximal approximate margial gai (lie 5-29). At each iteratio, the algorithm chooses oe ode from V S (lie 8-25). More specifically, for each ode v i (V S), Alg. 2 first estimates N k (S {v i }) usig the FM sketch OBP (lie 9-2). The, Alg. 2 calculates the approximate margial gai of ode v i (ρ i (S) = ( λ)w i + λ N k(s {v i }) N k (S) ) ad records the ode with the maximal approximate margial gai (lie 22-25). Fially, Alg. 2 adds the ode with maximal approximate margial gai ito the aswer set (lie 26-27) ad re-estimates N k (S) by the FM sketch NBP (lie 28-29). Theoretically, Alg. 2 achieves /e ɛ approximatio guaratee with high probability for the geeralized diversified rakig problem, because the FM sketch approximates the k-step expasio of set S withi a ɛ error boud i high probability [22]. I our experimets, we will show that the performace of Alg. 2 is desirable. I the followig, we aalyze the time ad space complexity of Alg. 2. Complexity aalysis of the radomized greedy algorithm: The time complexity of Alg. 2 is O(k E + V ). Specifically, i lie, Alg. 2 computes the persoalized PageRak vector which cosumes O( E ) time complexity. I lie 2-, Alg. 2 eeds to take O(k( E + V )) time to sketch the k-step expaded set for all odes. I lie 2-29, the algorithm takes O( V ) time to fid the aswer set. Note that the bitwise-or ca be doe i ear costat time complexity [23]. Thus, the time complexity of Alg. 2 is O(k E + V ). For the space complexity, like Alg., Alg. 2 eeds to store the graph G ad the persoalized PageRak vector w, which cosumes O( V + E ). I additio, Alg. 2 eeds to maitai O( V ) FM sketches, which takes O( V log V ) bits. As a result, the space complexity of Alg. 2 is O( V log V + E ). Notice that the space complexity of Alg. 2 is approximately O( E ), as O( V log V ) ca be domiated by O( E ) i most graphs. Puttig it all together, we coclude that the

8 IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 22 8 time ad space complexity of Alg. 2 is liear w.r.t. the graph size, thereby it ca be scalable to large graphs. 4.3 Miimum relevace diversified measures Besides MMR, there also exist other diversificatio criterios [26][27]. Here, we discuss some potetial variats of the proposed diversified measures based o the miimum relevace criterio [26], where the worse-case relevace will be maximized. The miimum relevace diversified measures are give as follows: ad J(S) = ( λ) mi u S J k (S) = ( λ) mi u S w u + λ N(S), (7) w u + λ N k(s). (8) Ulike F (S) ad F k (S), the miimum relevace diversified measures defied above are ot submodular. Thus, we caot easily desig a efficiet greedy algorithm with a approximatio guaratee. I effect, it is easy to show that the first term of set fuctio J(S) or J k (S) is supermodular 2 [28] ad the secod term is submodular. Thus, the set fuctio J(S) or J k (S) is a sum over a submodular ad a supermodular fuctio, which could be approximately solved by a supermodular-submodular procedure [28]. But ufortuately, both the covergece properties ad the approximatio factor of the supermodular-submodular procedure are ot kow ow. Developig efficiet algorithm with performace guaratee to maximize J(S) ad J k (S) is a iterestig future work. 5 EXPERIMENTS I this sectio, we evaluate the effectiveess ad efficiecy of the proposed approaches. Below, we first describe the experimetal setup, ad the report our experimetal results. 5. Experimetal setup Datasets: We coduct our experimets o five real etworks, three collaboratio etworks, oe citatio etwork, ad oe social etwork. Collaboratio etworks. We select three collaboratio etworks from Staford etwork datasets [29]: amely GrQc, HepTh, ad CodMat. GrQc, HepTh, ad CodMat are collaboratio etworks collected from the e-prit arxiv archive ad cover all the co-authorships betwee authors o Geeral Relativity ad Quatum Cosmology, High Eergy Physics-Theory, ad Codese Matter Physics, respectively. Notice that all the collaboratio etworks are udirected graph. Citatio etwork. We choose a citatio etwork, amely citehepth, from Staford etwork 2. A set fuctio J(S) is called supermodular, if J(S) is submodular. datasets [29]. The citehepth is a citatio etwork of papers o high eergy physics theory, which is origially collected from e-prit arxiv archive. The citatio etwork is a directed graph. The social etwork. Flickr is a popular photo shared website. The users i Flickr ca upload photos, make frieds as well as joi i various iterest groups. I our experimets, we employ the Flickr dataset from ASU social computig data repository [3]. The dataset cotais a udirected social etwork with 8,53 odes ad 5,899,882 edges ad 95 differet groups that the users joied. The detailed statistical iformatio of our datasets are preseted i Table. From Table, we ca observe that the approximate domiatio umber (DN) of our datasets, which is calculated by a greedy algorithm give i [2], are greater tha,. However, i may practical retrieval systems, users are ofte iterested i the top- results, where is a small costat (eg. =3) ad it is typically smaller tha the approximate DN. TABLE Summary of the datasets ame odes edges approximate DN GrQc ,98,598 HepTh 9,877 5,97 2,829 CodMat 23,33 86,936 4,449 citehepth 27,77 352,87 3,57 Flickr 8,53 5,899,882 3,768 Evaluatio metrics: I the literature, there are o well accepted measures for diversity i rakig o graphs [3]. I our experimets, we employ two metrics to measure the diversity. Oe is proposed i [6], which makes use of the desity of the iduced subgraph by the top- rakig odes. The desity of a graph is a ratio that is equal to the umber of edges existig i the graph divided by the maximum possible umber of edges i the graph. Ituitively, the desity iversely measures the diversity of the top- rakig odes. The secod metric is the expasio ratio which is defied i Def. 2.. The ratioale is that the larger expasio ratio of the top- rakig odes idicates the better diversity. For comparig the relevace with differet algorithms, we use the relevace metric give i [8]. Specifically, the relevace Rel is calculated as v Rel = w i S i v i S w, (9) i where S deotes the top- diversified rakig list by the diversified rakig algorithm, S deotes the top- rakig list by the persoalized PageRak algorithm. Note that Rel defied i Eq. (9) falls ito a iterval [, ], as the persoalized PageRak algorithm always gives the most relevat odes. By defiitio, the higher Rel implies better relevat.

9 IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 22 9 Baselies: We compare our proposed methods with six baselies uder diversity ad relevace metrics defied above. For our methods, we maily focus o k-step, for k = ad k = 2, deoted by Expasio- (Ep) ad Expasio-2 (), respectively. Ep ad are tested usig Alg. ad Alg. 2, respectively. We will study the effectiveess of the parameter k i the followig sectio. For other k-step expasios (k > 2), the performace is ot sigificatly better tha the -step ad 2-step expasios. The six baselies are as follows. Persoalized PageRak (PPR): PPR is a atural competitor of our algorithm, which ca be served as a baselie for evaluatig relevace. Grasshopper (Gra): Gra is a diversified rakig algorithm that leverages a absorbig radom walk to achieve diversity [5]. Gra has bee successfully used i diversified documet summarizatio ad rakig actors i social etworks. Maifold Rakig with Stop Poits (MRSP): MRSP is proposed i [7], which is very similar to the Grasshopper algorithm. It ca also be used o graphs. DivRak (DivR): DivR makes use of the statioary distributio of a vertex reiforced radom walk to rak odes [6]. It has bee applied to diversify rakig i iformatio etworks. There are two various implemetatio of DivR, amely poitwise DivR ad cumulative DivR respectively. As reported i [6], the two algorithms achieve the similar rakig performace. Hece, we use the poitwise DivR i our experimets. Drago (Dra): Dra is a scalable diversified rakig algorithm [8]. Dra aims to optimize a predefied diversified rakig measure. Ulike our diversified rakig measure, the measure used i Dra lacks topological explaatio, thereby it is ot ituitive ad reasoable to some extet. Diversified rakig via Resistive Graph Ceters (RGC): RGC [] aims to lear a diversified teleport vector to achieve diversity i rakig. However, the time complexity of RGC is cubic, thereby it caot scale to large graphs. We do ot make compariso with the MMR algorithm [3] because [6] has show that DivR outperforms MMR over graph datasets. Parameter settigs: I our proposed algorithms (Alg. ad Alg. 2), there are two commo parameters: the dampig factor α for computig the persoalized PageRak, ad the parameter λ used to tradeoff relevace ad diversity. We set α = 5 as it is widely used i web search. For the parameter λ, we set it to.5 because it is ot very sesitive i our experimets. We will show the effect of λ i the followig sectio. Additioally, for Alg. 2, we use 5 hashig fuctios to implemet the FM sketch. For all parameters of the baselie methods, we use the same settigs as give i the origial papers respectively. Experimetal eviromet: All the experimets are coducted o a Widow Server 27 with 4xDual- Core Itel Xeo 2.66 GHz CPU, ad 4G memory. All algorithms are implemeted by MATLAB (R2a). 5.2 Experimetal results I all of our experimets, we radomly geerate queries, ad the results are the average over all the queries. We give the detail results as follows. Results o collaboratio etworks: I this experimet, we compare Ep ad with six baselies over three collaboratio etworks. Fig. 2(a), Fig. 2(b), ad Fig. 2(c) depict our results o GrQc, HepTh, ad CodMat datasets, respectively. From Fig. 2(a), we ca observe that DivR ad Gra achieve ear-optimal relevace, followed by Ep, Dra,, MRSP, ad RGC. Note that the relevace of both Ep ad are more tha over differet values, which idicates that our algorithms ca obtai relevat results w.r.t. the queries. We ca clearly see that the relevace of RGC is extremely low, which is less tha.3 over differet values. This result implies that RGC may produce irrelevat ad meaigless results. For the diversity, we fid that is the wier uder the expasio ratio metric amog all the algorithms. Besides, Ep also outperforms other baselies uder the expasio ratio metric. The expasio ratio by DivR, Gra, ad MRSP are slightly worse tha PPR, which suggests that DivR, Gra, ad MRSP do ot perform well to ehace diversity i collaboratio etworks uder the expasio ratio metric. Uder the desity metric, RGC outperforms the competitors (recall that smaller desity implies better diversity). Ep,, ad MRSP achieve comparable desity, ad they are slightly worse tha Dra. DivR ad Gra also do ot perform well uder the desity metric. Similar results ca be observed i HepTh ad CodMat datasets. Based o the observatios, o the collaboratio etworks, we coclude that DivR, Gra, ad MRSP do ot perform well regardig diversity. The reaso would be that these algorithms lack a clear explaatio for diversity. RGC exhibits excellet performace for improvig diversity, but it sigificatly sacrifices the performace of relevace. Our Ep ad as well as Dra achieve a good tradeoff betwee the relevace ad the diversity. The reaso is that our algorithms ad Dra have a clear objective to optimize the predefied diversified rakig measures. Moreover, our algorithms exhibit better relevace ad better expasio ratio tha Dra. Results o citatio etwork: Ulike the collaboratio etwork, the citatio etwork is a directed graph. Here, we test MRSP by igorig the directio of the edges as MRSP caot be directly applied to the directed graphs. Fig. 3 describes our results.

10 IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 22 (i) Relevace vs. (ii) vs..2 (iii) Desity vs. Relevace PPR Ep Dra DivR RGC Gra MRSP Desity (a) Results o GrQc dataset. (i) Relevace vs..35 (ii) vs..6 (iii) Desity vs Relevace Desity (b) Results o HepTh dataset. Relevace (i) Relevace vs (ii) vs. (c) Results o CodMat dataset. Desity (iii) Desity vs. Fig. 2. Compariso of various diversified rakig algorithms o collaboratio etworks (color olie). Relevace.6 (a) Relevace vs (b) vs. Desity (c) Desity vs. PPR Ep Dra DivR RGC Gra MRSP Fig. 3. Compariso of various diversified rakig algorithms i citehepth dataset. From Fig. 3, we fid that Gra outperforms other algorithms by relevace metric. RGC shows the lowest relevace, which suggests that RGC may geerate completely irrelevat rakig results. For other baselies except PPR, they show comparable relevace. For our approaches, Ep shows better relevace tha. For the diversity, outperforms the other algorithms uder the expasio ratio metric. The expasio ratio by Ep is better tha the expasio ratio by the six baselie algorithms. However, uder the desity metric, we ca observe that RGC gets the best performace. Our approaches, MRSP, ad Dra achieve comparable desity. Also, for our approaches, is slightly better tha Ep uder the desity metric. I geeral, the results o the citatio etworks cosist with the results o the collaboratio etworks. Results o Flickr social etwork: Here we test our proposed algorithms i Flickr social etwork. Our goal is to fid the top- users who ot oly have higher persoalized PageRak scores relative to the queries, but also cover as may iterest groups as possible. Hece, i additio to the diversity measures described i Sectio 5., we itroduce the group

11 IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 22 Relevace.6 (a) Relevace vs..6 (b) vs. Precisio (a) Codmat Ep Dra Precisio (b) citehepth Precisio (c) Flickr Desity (c) Desity vs Group coverage (d) Group coverage vs. PPR Ep Dra Fig. 4. Compariso of various diversified rakig algorithms i Flickr social etwork. coverage as a ew diversity measure i this experimet. Ituitively, the more groups that are covered by the top- rakig list the better diversity it has. I this experimet, we oly compare our Ep ad with PPR ad Dra. The reaso is of twofold. First, the other baselies either caot get aswers i 2 hours or caot be coducted due to their memory requiremets. Secod, as observed i our previous experimets, Dra outperforms the other baselies. Our results are show i Fig. 4. From Fig. 4, we ca observe that both Ep ad sigificatly outperform Dra based o the relevace, the expasio ratio, ad the group coverage metrics. More specifically, uder the relevace ad expasio ratio metrics, Ep is clearly the best performer amog all the diversified rakig algorithms. Also, otice that the relevace by Dra decreases as the icreases. Whe =, Dra exhibits low relevace (less tha ). Istead, our algorithms show quite robust relevace w.r.t. differet values. Furthermore, the relevace of our algorithms are greater tha over various values. Uder the desity metric, Dra slightly outperforms Ep ad. However, uder the group coverage metric, achieves the best performace, followed by the Ep, Dra, ad the PPR. From the practical poit of view, the performace of our algorithms are better tha the performace of Dra, because the rakig results by our algorithms cover more iterest groups tha that of Dra. The reaso ca be that our diversified rakig measures capture the topological properties of the graph, which is more ituitive ad reasoable tha the measure used i Dra. To summarize, over all of our experimets, we make the followig observatios. () DivR ad Gra achieve ear-optimal relevace but their performace of improvig diversity is quite low. (2) RGC gets ear-optimal diversity uder the desity metric, but it exhibits extremely low relevace. (3) The performace Fig. 5. Compariso of precisio of Ep,, ad Dra. of MRSP is very low uder the expasio ratio metric (eve worse tha PPR). (4) Ep,, ad Dra show a good balace betwee the relevace ad the diversity. Moreover, our Ep ad exhibit better relevace ad diversity tha Dra over most datasets used. Precisio compariso: To further evaluate the effectiveess of our algorithms, we compare the precisio of our approaches with the state-of-the-art Dra. Sice there is o groud truth i graph-type datasets, we use the persoalized PageRak as the groud-truth rak which is also used i [8]. The precisio is defied by the followig formula: P re = S S / S, () where S ad S is defied i Eq. (9). Fig. 5 depicts our results i Codmat, citehepth, ad Flickr datasets. Similar results ca be observed i other datasets. From Fig. 5, we ca clearly see that both Ep ad cosistetly outperform Dra i Codmat ad Flickr datasets over differet. I citehepth dataset, we ca observe that all three algorithms geerate comparable rak, ad the performace of Ep is slightly better tha Dra. The performace of Dra is ot very stable over our datasets. I citehepth dataset, the performace of Dra is comparable to our algorithms, but i i Flickr dataset, Dra does ot perform well (precisio is lower tha give = ). This result implies that Dra produces less meaigful rak i Flickr dataset. I cotrast to Dra, the performace of our algorithms is very stable over differet datasets. I this sese, we ca coclude that our algorithms are better tha Dra. Time compariso: We compare the average query processig time of various diversified rakig algorithms over five etwork datasets. We take the average o the query processig time of the rakig algorithms over differet values ad differet queries. Table 2 shows our results. From Table 2, we ca observe that PPR is the most efficiet algorithm. Ep ad Dra achieve competitive efficiecy with PPR. is slightly worse tha Ep, Dra, ad PPR, but is still very efficiet due to the liear time ad space complexity. For the other baselies, we ca clearly see that their time requiremets are very high. More worse, o the Flickr dataset, RGC, Gra, ad MRSP caot get the top- rakig results i 2 hours, ad DivR caot be coducted due to its memory

12 IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 22 2 TABLE 2 Average query time of various algorithms (i secod). GrQc HepTh CodMat citehepth Flickr PPR EP EP Dra DivR RGC Gra MRSP Relevace Desity (a) Relevace vs. λ.5.6 λ (c) Desity vs. λ Ep PPR Dra.6 λ Group coverage Fig. 6. The effect of parameter λ. (b) vs. λ.6 λ (d) Group coverage vs. λ.6 λ requiremet. This results cofirm our time ad space complexity aalysis i the previous sectios. Effect of parameter λ: We study the effect of the parameter λ i Ep ad, i.e. λ i Eq. (2) ad Eq. (5), which is leveraged to tradeoff the relevace ad the diversity. Here we study the top 3 rakig results (=3) uder differet λ values i Flickr dataset. Similar results ca be observed i other datasets ad for other. We use the results of PPR ad Dra as the baselies. The reasos are () the rakig result by PPR is a atural measure for relevace, ad (2) Dra outperforms other baselies. The results are depicted i Fig. 6. As ca be see i Fig. 6(a), the relevace by decreases as λ icreases, while the relevace by Ep is robust w.r.t. λ. For the relevace, both Ep ad outperform Dra. Accordig to Fig. 6(b), Fig. 6(c), ad Fig. 6(d), we ca observe that the diversity by Ep, which is measured by the expasio ratio, desity, ad group coverage, geerally icreases as λ icreases. This is because a larger λ meas more weights are assiged, i order to improve the diversity i our diversified measure (Eq. (2)). We also fid that Ep is very robust w.r.t. λ. I additio, we ca clearly see that both Ep ad outperform Dra by the expasio ratio ad group coverage measures, while by desity measure, our algorithms are slightly worse tha Dra. Scalability testig ad memory cosumptio: To Average query time (s) Ep # of odes ( 5 ) Average query time (s) Ep # of edges ( 5 ) Fig. 7. Scalability of the proposed algorithms. Memory cosumptio (G) Memory cosumptio of the proposed algo- Fig. 8. rithms Ep # of odes ( 5 ) Memory cosumptio (G) Ep # of edges ( 5 ) study the scalability of Ep ad, we geerate two sets of sythetic graphs G with odes ragig from, to 9, ad edges from 8, to 4,, usig the Erdos-Reyi radom graph model, respectively. Here we set = 3, ad similar results ca be observed for other. Our results are described i Fig. 7. From Fig. 7, we ca clearly see that both Ep ad scale liearly w.r.t. both the umbers of odes (left part of Fig. 7) ad edges (right part of Fig. 7). Therefore, our Ep ad ca be used for very large graphs. The results cofirm our time complexity aalysis i the previous sectios. To validate the space complexity of our algorithms, i Fig. 8, we show the memory cosumptio of our algorithms i the same set of sythetic graphs. Specifically, i the left part of Fig. 8, we ca see that the memory cosumptio of both Ep ad icrease as the umber of odes icreases. The curves of both Ep ad become a lie whe the umber of odes is larger tha 5,. Similarly, from the right part of Fig. 8, we ca observe that the memory cosumptio of both Ep ad icrease as the umber of edges icreases, ad the curves of Ep ad ted to be a lie whe the umber of edges is larger tha 2,4,. These results cofirm the liear space complexity of our algorithms. Performace of Alg. 2: It is worth otig that Alg. 2 gives a approximate aswer istead of the exact aswer give by Alg.. We evaluate the approximatio performace of Alg. 2. To this ed, firstly, we use Alg. 2 to test the -step expasio (set k= i Alg. 2), ad we refer to it as Approx. Ep. We compare the performace of Approx. Ep with Ep, which is implemeted by Alg.. Fig. 9 shows our results i Flickr dataset. Similar results ca be observed i other datasets. From Fig. 9(a), we ca fid that Approx. Ep shows better relevace tha Ep. However, from Fig. 9(b), (c), ad (d), Approx. Ep is slightly worse tha Ep uder the three diversity

13 IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 22 3 Relevace Desity (a) Relevace (c) Desity.5.3 Performace of the radomized greedy algo- Fig. 9. rithm. Relevace Desity Ep Approx. Ep Group coverage.6 (b) (d) Group coverage.6 (a) Relevace vs. k k (c) Desity vs. k k Group coverage (b) vs. k k.35.3 (d) Group coverage vs. k k step expasio k Fig.. The effect of parameter k i k-step expasio based algorithms. metrics. Overall, Approx. Ep achieves comparable performace with Ep. This results suggest that our radomized greedy algorithm (Alg. 2) ca achieve a good performace guaratee, which cosists with our aalysis i Sectio 4. Effect of parameter k: We ivestigate how the parameter k affects the performace of the k-step expasio based algorithms, which are implemeted by Alg. 2. Fig. shows our results i Flickr dataset, ad the similar results ca be observed i other datasets. From Fig., we ca see that the relevace ad diversity are geerally ot sesitive w.r.t. differet k whe k 2. The 2-step expasio (k=2) achieves the best expasio ratio ad desity, thereby i our previous experimets we set k=2. 6 RELATED WOR Diversified rakig o text data: Diversity has bee recogized as importat criteria i iformatio retrieval. There are a large body of works o query or search results diversificatio [3][32][33][34][35][36][37]. I documet retrieval, oe of a well-kow method is the maximal margial relevace (MMR) proposed by Carboell ad Goldstei [3], which achieves diversity by maximizig a liear combiatio fuctio that captures both dissimilarity amog the results ad relevace w.r.t. the query. After Carboell ad Goldstei s work, may approaches addressig diversificatio have bee proposed i recet years. Zhai, et al. [38] propose a subtopic retrieval approach to results diversificatio. Agrawal, et al. [39] formulate the query results diversificatio as a submodular fuctio maximizatio problem. Gollapudi, et al. [26] preset several axioms for query results diversificatio. All the above metioed methods primarily address to documets data. A excellet survey o query results diversificatio is give i [27]. Submodular set fuctio maximizatio: Our diversified rakig problem is closely related to submodular set fuctio maximizatio problem, which is geerally NP-hard. However, there always exists a ear-optimal greedy algorithm for solvig such problem [7]. There are may applicatios that have bee formulated as a submodular set fuctio maximizatio problem such as ifluece maximizatio problem i social etworks [4], observatio selectio ad sesor placemet problem [4], [42], documet summarizatio problem [43], [37], as well as the set cover problem [44]. I this paper, we formulate the diversified rakig problem o graphs as the submodular set fuctio maximizatio problem. Expasio o graphs: Our work is also related to the expasio of a graph, which is a well kow cocept i expader graph theory [2]. This cocept recetly is used for samplig commuity structure [45] ad facilitatig decetralized search i etworks [4]. However, our defiitio of expasio is differet from the previous work, ad we leverage expasio to measure diversity of the top- rakig results. 7 CONCLUSIONS I this paper, we preset a study of fidig top- diversified rakig o graphs. Firstly, we propose a ovel diversified rakig measure, which captures both relevace ad diversity. Secodly, we prove the submodularity of this measure ad desig a efficiet greedy algorithm to achieve ear-optimal diversified rakig. The proposed method has liear time ad space complexity w.r.t. the size of the graph, thus it ca be scalable to large graphs. Thirdly, we preset a geeralized diversified rakig measures ad develop a efficiet radomized greedy algorithm for maximizig it accurately. Fially, extesive experimets show the effectiveess, efficiecy ad scalability of the proposed methods.

IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 22 4 ACNOWLEDGMENTS The work was supported by grat of the Research Grats Coucil of the Hog og SAR, Chia No. CUH/499.

Haveliwala, Topic-sesitive pagerak, i WWW 2. [4] G. Jeh ad J. Widom, Scalig persoalized web search, i WWW 3. [5] X. Zhu, A. B. Goldberg, J. V. Gael, ad D.

14 IEEE TRANSACTIONS ON NOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 22 4 ACNOWLEDGMENTS The work was supported by grat of the Research Grats Coucil of the Hog og SAR, Chia No. CUH/499. REFERENCES [] S. Bri ad L. Page, Pagerak: Brigig order to the web, Staford Digital Library Project, Tech. Rep., 997. [2] M. E. J. Newma, Networks: A Itroductio. OXFORD Uiversity Press, 2. [3] T. H. Haveliwala, Topic-sesitive pagerak, i WWW 2. [4] G. Jeh ad J. Widom, Scalig persoalized web search, i WWW 3. [5] X. Zhu, A. B. Goldberg, J. V. Gael, ad D. Adrzejewski, Improvig diversity i rakig usig absorbig radom walks, i HLT-NAACL 7. [6] Q. Mei, J. Guo, ad D. R. Radev, Divrak: the iterplay of prestige ad diversity i iformatio etworks, i DD. [7] X. Zhu, J. Guo, X. Cheg, P. Du, ad H. She, A uified framework for recommedig diverse ad relevat queries, i WWW. [8] H. Tog, J. He, Z. We, R. ouru, ad C.-Y. Li, Diversified rakig o large graphs: a optimizatio viewpoit, i DD, 2. [9] R.-H. Li ad J. X. Yu, Scalable diversified rakig o large graphs, i ICDM, 2, pp [] A. Dubey, S. Chakrabarti, ad C. Bhattacharyya, Diversity i rakig via resistive graph ceters, i DD, 2, pp [] O. Haggstrom, Fiite markov chais ad algorithmic applicatios. Cambridge Uiversity Press, 22. [2] S. Hoory, N. Liial, ad A. Wigderso., Expader graphs ad their applicatios, Bull. Amer. Math. Soc., vol. 43, pp , 26. [3] J. G. Carboell ad J. Goldstei, The use of mmr, diversitybased rerakig for reorderig documets ad producig summaries, i SIGIR 98. [4] A. S. Maiya ad T. Y. Berger-Wolf, Expasio ad search i etworks, i CIM. [5] R. umar,. Puera, ad A. Tomkis, Hierarchical topic segmetatio of websites, i DD 6. [6] T. Lappas, E. Terzi, D. Guopulos, ad H. Maila, Fidig effectors i social etworks, i DD. [7] G. L. Nemhauser, L. A. Wolsey, ad M. L. Fisher, A aalysis of approximatios for maximizig submodular set fuctiosi, Mathematical Programmig, vol. 4, pp , 978. [8] U. Feige, A threshold of l for approximatig set cover, J. ACM, vol. 45, pp , 998. [9] J. Leskovec, A. rause, C. Guestri, C. Faloutsos, J. M. Va- Briese, ad N. S. Glace, Cost-effective outbreak detectio i etworks, i DD, 27. [2] T. W. Hayes, S. T. Hedetiemi, ad P. J. Slater, Domiatio i graphs: advaced topics. MARCEL DEER, INC, 998. [2] T. H. Corme, C. Leiserso, R. Rivest, ad C. Stei, Itroductio to Algorithms, Third Editio. MIT Press, 2. [22] P. Flajolet ad G. N. Marti, Probabilistic coutig algorithms for data base applicatios, J. Comput. Syst. Sci., vol. 3, o. 2, pp , 985. [23] C. R. Palmer, P. B. Gibbos, ad C. Faloutsos, Af: a fast ad scalable tool for data miig i massive graphs, i DD, 22, pp [24] M. Durad ad P. Flajolet, Loglog coutig of large cardialities (exteded abstract), i ESA, 23, pp [25] P. Flajolet, E. Fusy, O. Gadouet, ad F. Meuier, Hyperloglog: the aalysis of a ear-optimal cardiality estimatio algorithm, i ESA, 23, pp [26] S. Gollapudi ad A. Sharma, A axiomatic approach for result diversificatio, i WWW 9. [27] M. Drosou ad E. Pitoura, Search result diversificatio, SIGMOD Rec., vol. 39, pp. 4 47, 2. [28] N. Narasimha ad J. Bilmes, A supermodular-submodular procedure with applicatios to discrimiative structure learig, i UAI 5. [29] J. Leskovec, Stadford etwork aalysis project, 2. [Olie]. Available: [3] R. Zafarai ad H. Liu, Social computig data repository at ASU, 29. [Olie]. Available: edu [3] F. Radliski, P. N. Beett, B. Carterette, ad T. Joachims, Redudacy, diversity ad iterdepedet documet relevace, SIGIR Forum, vol. 43, 29. [32] Y. Zhag, J. P. Calla, ad T. P. Mika, Novelty ad redudacy detectio i adaptive filterig, i SIGIR 2. [33] C.-N. Ziegler, S. M. McNee, J. A. osta, ad G. Lause, Improvig recommedatio lists through topic diversificatio, i WWW 5. [34] C. L. A. Clarke, M. olla, G. V. Cormack, O. Vechtomova, A. Ashka, S. Büttcher, ad I. Macio, Novelty ad diversity i iformatio retrieval evaluatio, i SIGIR 8. [35] H. Ma, M. R. Lyu, ad I. ig, Diversifyig query suggestio results, i AAAI. [36] E. Miack, W. Siberski, ad W. Nejdl, Icremetal diversificatio for very large sets: a streamig-based approach, i SIGIR, 2, pp [37] H. Li ad J. Bilmes, A class of submodular fuctios for documet summarizatio, i ACL, 2, pp [38] C. Zhai, W. W. Cohe, ad J. D. Lafferty, Beyod idepedet relevace: methods ad evaluatio metrics for subtopic retrieval, i SIGIR 3. [39] R. Agrawal, S. Gollapudi, A. Halverso, ad S. Ieog, Diversifyig search results, i WSDM 9. [4] D. empe, J. M. leiberg, ad É. Tardos, Maximizig the spread of ifluece through a social etwork, i DD, 23, pp [4] A. rause ad C. Guestri, Near-optimal observatio selectio usig submodular fuctios, i AAAI, 27, pp [42] A. rause, A. P. Sigh, ad C. Guestri, Near-optimal sesor placemets i gaussia processes: Theory, efficiet algorithms ad empirical studies, Joural of Machie Learig Research, vol. 9, pp , 28. [43] H. Li ad J. Bilmes, Multi-documet summarizatio via budgeted maximizatio of submodular fuctios, i HLT- NAACL, 2. [44] V. V. Vazirai, Approximatio Algorithms. Spriger, 24. [45] A. S. Maiya ad T. Y. Berger-Wolf, Samplig commuity structure, i WWW. Rog-Hua Li Rog-Hua Li is pursuig his PhD degree i Departmet of System Egieerig ad Egieerig Maagemet, The Chiese Uiversity of Hog og, Hog og. His research iterests iclude social etwork aalysis ad miig, complex etwork theory, ucertai graphs miig, Mote- Carlo algorithms, ad machie learig. Jeffery Xu Yu Jeffrey Xu Yu received the BE, ME, ad the PhD degrees i computer sciece from the Uiversity of Tsukuba, Japa, i 985, 987, ad 99, respectively. He held teachig positios i the Istitute of Iformatio Scieces ad Electroics, Uiversity of Tsukuba, Japa, ad the Departmet of Computer Sciece, The Australia Natioal Uiversity. Curretly, he is a professor i the Departmet of Systems Egieerig ad Egieerig Maagemet, the Chiese Uiversity of Hog og. He is servig as a VLDB Joural editorial board member. His curret mai research iterest icludes graph database, graph miig, keyword search i relatioal databases, ad social etwork aalysis.

Counting the Number of Minimum Roman Dominating Functions of a Graph

Counting the Number of Minimum Roman Dominating Functions of a Graph Coutig the Number of Miimum Roma Domiatig Fuctios of a Graph SHI ZHENG ad KOH KHEE MENG, Natioal Uiversity of Sigapore We provide two algorithms coutig the umber of miimum Roma domiatig fuctios of a graph