Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX 76019

Size: px

Start display at page:

Download "Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX 76019"

Kelly Powell
5 years ago
Views:

Department of Computer Siene and Engineering University of Texas at Arlington Arlington, TX 76019 Extensions to

1 Department of Computer Siene and Engineering University of Texas at Arlington Arlington, TX Extensions to Pairwise Similarity Calulation of Information Networks Yuanzhe Cai and Sharma Chakravarthy Tehnial Report CSE

2 Extensions to Pairwise Similarity Calulation of Information Networks Yuanzhe Cai and Sharma Chakravarthy CSE Department and Information Tehnology Laboratory The University of Texas at Arlington, Arlington, TX 76019, USA Abstrat. We fous on extensions to the pairwise similarity alulation of information networks. By onsidering both in- and out-link relationships, we propose Additive- and Multipliative-SimRank to alulate the similarity sore. Then we disuss the loop/yles problem of information networks and propose a method to address this problem. Our extensive experimental results onduted on eight food web data sets show that our approah performs signifiantly better than earlier approahes. 1 INTRODUCTION In order to study the patterns and proesses of information systems, omputing pairwise similarity in an information network is a fundamental problem. Food web 1, a kind of information network, represents the predator-prey relationship between speies within an eosystem. Consider the following example from the food web. Example 1.(Motivation).The dodos 2 lived peaefully on Mauritius Island for several hundred years. Beause of poahing by the humans and killing by the animals (suh as pigs, rats and ats), that have been introdued into the island by sailors, the Dodo bird died off extremely quikly. About 1681, the last Dodo bird died. After about three hundred years, in 1973, Tambalaoque, also alled dodo tree, was dying out. There are only about 13 trees living in the island. Sientists found that the dodo tree s seed should pass through the digestive system of dodo before they germinated. Therefore, in order to aid the seed in germination, sientists used turkeys to erode the nutshell of the dodo tree seed. In this ase, the humans saved the dodo tree, but the turkey, similar to the ats, rats and pigs, whih have been introdued into this island may also spoil the balane of the eosystem. Some failed examples, suh as Austrian Rabbit and Xisha Islands at, are also alarming. Therefore, there is an interesting question, If one speies get extint in an eosystem and we want to introdue a new speies into this eosystem to keep the balane, what kinds of speies should we introdue? The answer is that we should introdue the speies that has similar food habit in this eosystem. In this example, turkey and dodo bird are very similar, beause they both eat similar foods, suh as the seed of the dodo tree and they both have similar natural enemies. In this example, the prey relationship and is-preyed relationship are used to define the similarity sore between two speies. Based on this observation, a number of approahes have been proposed to quantify similarity between speies in a food web. The most widely used approahes in the former researh work [10] [11] [14] are the Jaardian similarity funtions. The intuition behind Jaardian similarity funtion is that two speies are similar, if they share many similar food web neighbors and the total number of their neighbors is less. The S jaard (a, b) [10] equation is shown below: 1 web 2 S jaard (a, b) = ( ) n(a) n(b) n(a) n(b) (1)

3 where n(a) and n(b) is the neighbors of speies a and b. n(a) n(b) is the total number of prey and predator speies that speies a and b have in ommon and n(a) n(b) is the total number of prey and predators of speies a and b. 7.gruiformes 8.duks 4.lizards 1.fishing spider 5.salamander 6.small frogs 2.rayfish 3.apple snail Jaard SimRank S(1, 2) S(2, 3) S(4, 5) S(5, 6) S(7, 8) Table 1: Similarity Sore Fig. 1: Segment of CYPWET data set [3] However, earlier researhes only onsider the diret relationship in the information network. Considering the example in Figure 1, we want to alulate the similarity between fishing spider and rayfish. However, aording to the diret relationship, shown in Table 1, the similarity between these two speies is zero, although these two speies have some relationships from theirs indiret predator gruiformes. This example shows that when we onsider the similarity between two speies in the food web, we also need to onsider the indiret relationship between speies of other types related to them. This problem is addressed by the SimRank algorithm [4] in whih the similarity between two objets is reursively defined as the average similarity between two objets. However, this similarity definition only onsiders the one diretional relationship for the information network. In Figure 1, SimRank only onsiders the is-preyed relationship (their predators), but the speies similar preies also ontribute to their similarity relationship. The other major problem is that many real-life information networks ontain yles. Food web, for example, inludes many annibals that reate self-loops and yles. For example, in figure 1, salamander is a annibal and preys the other salamanders for food. These loops in the food web also influene the value of similarity sores. The main ontributions of this are: Based on relationship of topial strutures in the information networks, two similarity algorithms, Additive-SimRank and Multipliative-SimRank, are proposed to address this problem. We also prove that the proposed algorithms onverge by theoretial analysis. We disuss the loops and yles problem in the information networks and propose a method to handle them. Extensive experiments are onduted to evaluate the auray of the proposed algorithms. Additive-SimRank is shown to have higher auray as ompared with other methods. Roadmap: The rest of this paper is organized as follows: We introdue the related work in setion 2 and define the graph model in setion 3. SimRank is desribed in setion 4. Two similarity measures for information network and the yle problem are disussed in setion 5. Our experimental analysis is reported in setion 6 and onlusions are in setion 7. 2 RELATED WORK We ategorize existing work related to our study into three lasses: speies aggregation, link-based similarity alulation, and random walk on graph. Speies aggregation: Setting a new riterion for searhing ommunity food web data, Martineze [10] [11] was the first researher to systematially analyze the effets of variable speies aggregation on the network struture of food webs. There are different indies used to quantify

4 similarity between objets. The Jaard index is probably the best known and widely used in food web researh [10] [11] [14]. Martineze used an Additive-Jaard index to determine the similarity between speies in Little Rok Lake and then used the average-linkage-luster to aggregate taxonomies. However, these methods do not onsider the potential relationship between eah speies. Link-based similarity alulation: The earliest researh works for similarity alulation based on link analysis fous on the itation patterns of sientifi papers. The most ommon measures are o-itation [12] and o-oupling [6]. Co-itation indiates that if two douments are often ited together by other douments, they may have the same topi. The meaning of o-oupling for sientifi papers is that if two papers ite many papers in ommon, they may fous on the same topi. However, all these methods ompute similarity only by onsidering their immediate neighbors. In ontrast, SimRank [4] onsiders the entire relationship graph to determine similarity between two nodes. Beause of the high time omplexity (O(n 4 )) of this approah, many papers [2] [13] [1] [8] have foused on performane improvement, but a few have foused on the auray improvement of SimRank. In this paper, our fous is on extending SimRank approah by onsidering bidiretional relationships and yles to improve the auray of the original SimRank approah. Random walks on graphs: Theoretial basis of our work uses hit times for two surfers walking randomly on the graph. We mainly refer to researh about expeted f-meeting distane theory [4]. Other researhes, suh as random walk theory [9] and Markov Model [7], also help understand our researh. 3 GRAPH MODEL The food web data an be represented as a direted graph, G(V, E), whih onsists of a set of nodes V representing speies and a set of direted edges E representing the relationships between speies. For example, Figure 1 is a relationship graph that desribes the predatory relationship in the marshes and sloughs. In this graph, a direted edge < p, q > from speies p to speies q orresponds to a predator relationship. I(v) denotes the set of predators preying on speies v, whih is also the in-link neighbors of speies v and O(v) denotes the set preyed-by speies v, whih is also the out-link neighbors of speies v. 4 OVERVIEW OF SimRank SimRank [4] is a method for measuring link-based similarity between objets in a graph that models the objet-to-objet relationships in a partiular domain. The intuition behind SimRank sore is that two objets are similar if they link to similar objets. This intuition also indiates that SimRank alulation needs to be reursive. Below, we present the formula to ompute SimRank. Given a graph G(V, E) onsisting of a set of nodes V and a set of links E, the SimRank similarity between objets a and b, denoted as S(a, b), is omputed, reursively, as follows: S(a, b) = { 1 if (a = b) I(a) I(b) I(a) I(b) j=1 S(I i(a), I j (2) (b)) if (a b) where is a onstant deay fator, 0 < < 1; I(a) is the set of in-neighbor nodes of a and I i (a) is the i th in-neighbor node of a. I(a) is the number of neighbors of node a. In ase that I(a) or I(b) is an empty set, S(a, b) is defined as zero. A solution to SimRank equation (2) an be reahed by iteration to a fixed-point. For eah iteration k, let S k (.,.) be an iteration similarity funtion and S k (a, b) be the iterative { similarity sore 0 if (a b) of pair (a, b) on iteration k. The iteration proess is started with S 0 (S 0 (a, b) = 1 if (a = b) ).

5 To alulate S k+1 (a, b) from S k (a, b), we use the following equation: S k+1 (a, b) = I(a) I(b) I(a) I(b) j=1 S k (I i (a), I j (b)) (3) In equation (3), 1/ I(a) is a single step probability of walking from node a to a node in I(a). Therefore we an use Bakward Transfer Probability Matrix (BT PageRank) to apture the single step probability in a Markov Chain. Thus, SimRank algorithm an be desribed by matrix alulation. S 0 = E, where E is an identity matrix. Equation (3) an be rewritten as: S k (a, b) = I(a) I(b) j=1 BT aii (a)bt bij (b)s k 1 (I i (a), I j (b)) (4) Although the onvergene of iterative SimRank algorithm an be guaranteed in theory, pratial omputation uses a tolerane fator ε to ontrol the number of iterations suh that a finite number of iterations are performed. It is reommend to set ε = 0.001, the same as in PageRank. Speifially, the terminating ondition of the iteration is as follows: max( S k (a, b) S k 1 (a, b) / S k 1 (a, b) ) ε (5) It indiates that the iteration stops if the maximal hange rate of similarity value between two iterations for all node pairs is smaller than the threshold ε. 5 EXTENDING THE SIMILARITY MEASURE In this setion, we first desribe our analysis of the information network. Then, we desribe our topologial similarity definition on the network. Finally, we disuss the loops problem on the network. 5.1 Topologial Similarity If we want to ompare the similarity between dodo and turkey on the Mauritius Island, we need to answer the following questions: 1. Do dodo bird and turkey eat similar food? If turkey does not eat dodo tree s seed, we do not need to introdue turkey into this eosystem, beause turkey doesn t have the similar role as the dodo bird in this eosystem. 2. Do dodo bird and turkey have similar natural enemies? If turkey does not have similar natural enemies as the dodo bird or do not have natural enemies, the dodo bird s natural enemies may not find enough food and also beome extint; or turkeys may proliferate and break the biologial balane. Thus, we an identify two intuitions for defining the similarity for the food web. Intuition 1: Two speies are similar, if they are preyed by similar speies. Intuition 2: Two speies are similar, if they prey similar speies. Let us look at Table 1 again. Surprisingly, SimRank doesn t produe a similarity sore for the pair gruiforms - duks, although these two speies have the same lassifiation (avifauna) and prey the same speies salamander. The problem for SimRank is that SimRank only onsiders ispreyed relationship on the food web, but the other important prey relationship is not onsidered for similarity alulation.

6 Considering both relationships for the food web, similarity sore should ombine the similarity from both relationships. Thus we an add is-preyed relationship similarity sore and prey relationship similarity sore together and use the parameter γ to adjust the ontribution of these two relationships for the total sore. We all this the additive method. Thus, we propose the following formula for alulating the similarity sore: S(a, b) = { 1 if (a = b) I(a) I(b) γ I(a) I(b) j=1 S(I i(a), I j (b)) + O(a) O(b) (1 γ) O(a) O(b) j=1 S(O i (a), O j (b)) if (a b) (6),where is a onstant deay fator, 0 < < 1; I(a) is the set of predators of speies a and I i (a) is the i th predators of a. I(a) is the number of predator of node a. O(a) is the set of prey of speies a and O i (a) is the i th prey of a. O(a) is the number of prey of node a. γ is a onstant parameter that use to adjust the different effet of the is-preyed and prey relationships, 0 γ 1. On the other hand, another way to extend SimRank is that we an multiply the is-preyed and prey relationship similarities. This produt sore an also desribe the relationship similarity sore. This method is alled as the multipliative method. Then, we have the following formula to alulate the similarity sore. S(a, b) = { 1 if (a = b) I(a) I(b) I(a) I(b) j=1 S(I i(a), I j (b)) O(a) O(b) O(a) O(b) j=1 S(O i (a), O j (b)) if (a b) (7) where parameter definitions are the same as that of the Additive ase. Algorithm 1 Additive-SimRank Require: Deay Fator, ; Tolerane Fator, ϵ; Forward Transfer Probability Matrix F T (the forward probability of moving from state i to state j in one step); Bakward Transfer Probability Matrix BT (the bakward probability of moving from state i to their j); Ensure: Similarity Matrix, S k ; 1: k 1; 2: S 0 identity; 3: while(max( S k (a, b) S k 1 (a, b) / S k 1 (a, b) ) > ε)) 4: k k+1; 5: S k 1 S k ; 6: for eah element S k (a, b) 7: S k (a, b) γ I(a) γ) I(a) I(b) j=1 F T ai i (a)f T bii (b)s k 1 (I i (a), I j (b)); 8: end for; 9: end while; 10: return S k ; I(b) j=1 BT ai i (a)bt bii (b)s k 1 (I i(a), I j(b)) + (1 Algorithm 1 outlines Additive-SimRank omputation. It takes in 4 arguments. The first two arguments inherit from the original SimRank algorithm: the deay fator gives the rate of deay as similarity flows aross edges in a graph and tolerane fator γ is to ontrol the number of iterations as disussed in setion 4. The next parameter is Forward Transfer Probability Matrix F T. As we an see from equation 6, 1/ O(a) is a single step probability of walking from node a to a node in O(a). Thus, we use the Forward Transfer Probability Matrix F T [7] to alulate the similarity sore in our algorithm. On the food web, F T matrix is the transfer matrix of prey

7 relationship. The last parameter is Bakward Transfer Probability Matrix BT. As we an also see from equation 6, 1/ I(a) is a single step probability of walking from node a to a node in I(a). Thus, we use the Bakward Transfer Probability Matrix BT [7] to alulate the similarity sore in our algorithm. On the food web, BT matrix is the transfer matrix of is-preyed relationship. Additive-SimRank algorithm first initializes variables (lines 1-2). In line 4, the algorithm will stop if the ending ondition, equation 5, will be satisfied. The algorithm then uses Equation 6 to alulate the similarity sore. Although the worst time and spae omplexity of Additive-SimRank is the same as the SimRank, its auray of Additive-SimRank is higher than original SimRank as it onsiders the both relationship of the graph. The Multipliative-SimRank algorithm is the same as the previous algorithm exept for step 7 where Equation 7 is used. The theoretial foundations of Additive-SimRank and Multipliative- SimRank are disussed below. Forward and Bakward Random Walk Model: Sine BT and F T in algorithm 1 (and its ounterpart for multipliative-simrank) an be onsidered as a single step bakward and forward transfer matrix of a Markov Chain, the iteration similarity alulation proess of equations 6 and 7 an be explained using two random surfers walking forward and bakward. Two surfers start from two nodes on the graph and they walk from one node to the other nodes step by step. In eah step, they will walk one step bakward or forward, respetively, and alulate the meeting possibility for these two surfers. The final result of these two methods an be translated into the possibility of two random surfers meeting with eah other by onsidering both forward and bakward random walking. For equations 6 and 7, we use different methods to ombine these meeting possibilities for eah step. In equation 6, we add these meeting possibilities of forward and bakward walking and use γ to adjust the proportion of these bakward and forward meeting possibilities. In equation 7, we diretly multiply the forward and bakward meeting sore. Sine SimRank only onsiders bakward random walk, it is a speial ase of our method. In equation 6, if γ is set to 1, the equation is the same as the SimRank funtion. The multipliative form distinguishes the roles of predator and prey for eah speies, and requires a high similarity in both roles to ahieve an overall high sore. On the other hand, unlike the multipliative form, additive form uses the parameter γ to adjust the weight that need to be assoiated with predator and prey for eah speies. We also give the onvergene proof of Additive-SimRank and Multipliative-SimRank as follow. Lemma 1. Let AR k (a, b) be the sore of k th iteration. T hen, AR k+1 (a, b) AR k (a, b) 0 (8) Proof. If a = b, then AR k+1 (a, b) = 1, AR k (a, b) = 1 by definition, AR k+1 (a, b) AR k (a, b) = 0 and thus (12) holds. In the same way, if I(a), I(b) = and O(a), O(b) =, then by definition, AR k+1 (a, b) = AR k (a, b) = 0, AR k+1 (a, b) AR k (a, b) is 0 and thus (12) holds. Indution Base Step: Let us prove that (13) holds for k = 0, i.e. that for every two nodes a, b: AR 1 (a, b) AR 0 (a, b) 0. If a b, AR 0 (a, b) = 0. AR 1 (a, b) is define by the iterative equation(1) as follow. AR 1 (a, b) AR 0 (a, b) = AR 1 (a, b) 0 Indutive Step: Provided that AR k (a, b) AR k 1 (a, b) 0, let s prove that (12) hold for (k + 1) as well: AR k+1 (a, b) AR k (a, b) = (γ I(a) I(b) (γ I(a) I(b) I(a) I(b) j=1 AR k(i i (a), I j (b))+(1 γ) O(a) O(b) I(a) I(b) j=1 AR k 1(I i (a), I j (b))+(1 γ) O(a) O(b) O(a) O(b) j=1 AR k (O i (a), O j (b)))) O(a) O(b) j=1 AR k 1 (O i (a), O j (b))))

8 = (γ I(a) I(b) ((1 γ) O(a) O(b) I(a) I(b) O(a) j=1 (AR k(i i (a), I j (b)) AR k 1 (I i (a), I j (b))) + O(b) j=1 (AR k(o i (a), O j (b)) AR k 1 (O i (a), O j (b))) AR k (I i (a), I j (b)) AR k 1 (I i (a), I j (b)) 0 and AR k (O i (a), O j (b)) AR k 1 (O i (a), O j (b)) 0 Thus, AR k+1 (a, b) AR k (a, b) 0 Lemma 2. AS(a, b) 1. Proof. Indution Base: Aording to the Additive-SimRank definition, AR 0 (a, b) 1 Indutive Step: Provided that AR k (a, b) 1, Let s prove AR k+1 (a, b) 1. I(a) I(b) AR k+1 (a, b) = γ I(a) I(b) j=1 AR k(i i (a), I j (b)) + O(a) O(b) (1 γ) O(a) O(b) j=1 AR k (O i (a), O j (b)) I(a) I(b) (γ I(a) I(b) j=1 1 + (1 γ) O(a) O(b) O(a) O(b) j=1 1)) = γ + (1 γ) = 1 Thus,AS(a, b) 1. Theorem 1. AS(a,b) will onverge to a fixed value. Proof. Aording to lemma 1, AR k (a, b) is the monotoni positive term series. Aording to lemma 2, AR k (a, b) has the upper bound. Thus, AS(a,b) will onverge to a fixed value. In the same way, Multipliative-SimRank similarity MS(a, b) for any node pair(a, b) will also onverge to a fixed value. 5.2 Dealing with Loops in the network The other problem of some information networks is that there ould be a number of yles or loops in the network. For example, food web ontains frequent annibalism that indues loops (e.g., salamander in Figure 1). In the dry season, 14% of salamanders food omes from killing other salamanders. Another example is of steatoda spiders and latrodetus spiders. These two spiders eat eah other. Table 2 shows the number of yles in the real world food web [3]. As we an see, yles are quite ommon in the food web. s1 k k sn sm Data set Vertex Edge Cyles CYPWET CYPDRY BAYWET BAYDRY MANGWET MANGDRY GRAMWET GRAMDRY Table 2: Statistis in Food Web Data sets Fig. 2: Representative graph with a loop/yle However, these yles on the food web graph will affet the speies similarity sore. Let us look at the similarity sore between two speies fishing spider, salamander and fishing spider, apple snail. Table 3 tabulates these similarity sore for figure 1. As we an see S(fishing spider, salamander) is slightly higher than S(fishing spider, apple snail). However, in the biologial field, fishing spider and apple snail are lassified as maro invertebrates but salamander is lassified as herpetofauna. In fat, fishing spider and salamander are not in the same lassifiation. Similarly,

9 other information networks, suh as the web page graphs and paper itation graphs, also has yles. For example, in the itation graph, the same author an write two papers that are ross-referened with eah other. We an also atually prove that the following theorem for the similarity alulation in the presene of loops in a graph Table 3: Additive-SimRank results for figure 1(with yles)(γ= 0.75, = 0.8) Table 4: Additive-SimRank results for figure 1(no yles)(γ= 0.75, = 0.8) Theorem 2. Consider one graph G with a yle l and a line q. Figure 2 shows suh a graph as G. Let l(s n, s m ) denote a sequene of yle verties s n, s i+1,..., s m. Let q(s 1, s n ) denote a sequene of line verties s 1,..., s i+1,..., s n. s n is the rossing point between yle l and line q. Let length(p) denote the length of path p, and length(l) = length(q) = k. Then, S(s 1, s n ) = k and S(s n, s m ) = 0. Proof. (i)aording to the SimRank definition [4], S(s 1, s n ) = t:(s 1,s n) >(x,x) P [t]h(t) = n 1 P [t]+ 2 n 2 P [t] k n k P [t] +... Aording to the definition of G, length(l) = length(q) = k. Thus, if two surfers walk from point s 1 and s n, after k step, these two surfers will meet at point s n and then these two surfer will stop. Thus, S(s 1, s n ) = k n k P [t]. Graph G only ontains one yle and one line. Thus, there is only one path from s 1 to s n in the line and one path from s n to s n. Therefore, P [t] = k/2 1 I(w i ) k/2 j=k 1 I(w j ) = 1 1 = 1. Thus, S(s 1, s n ) = k. (ii) s n and s m are two nodes in the yle. Thus, if two surfers walk from point s n and s m, these two walkers will never meet at any point in the yle. Thus, S(s n, s m ) = 0. This theorem provides us two insights about SimRank sores and why they are not intuitively right for the networks that ontain loops. First, s 1 is at the bottom of food web in Figure 2 and in normal ases it is the primary speies, suh as periphyton, utriularia, and so on. However, s n is the top onsumer, suh as bobat, panther and so forth. However, aording to theorem 1, these two speies s 1 and s n have a greater similarity between eah other. Seondly, s m is another speies in the yle. In this food web graph, this speies is also the top level onsumer. However, aording to theorem 1, the pair s 1 and s n have higher similarity sore then the pair s m and s n. This implies that bobat and periphyton are more similar than bobat and panther. That does not math with our intuition. Based on this example, we an address the problem of SimRank sores. In fat, the same problem also exists in Additive-SimRank and Multipliative-SimRank. Thus, before we alulate the similarity sore on the food web, we will delete all the relationships in the yle. Table 4 shows the similarity result when yles are deleted from the food web. The similarity sore of the pair fishing spider and salamander is equal to 0 and in fat those two speies are not in the same lassifiation. Clearly, this result mathes better with our intuition.

10 6 EXPERIMENTAL EVALUATION Data Sets: Our experiments use the data sets shown in Table 2. Please refer to [3] for details regarding these data sets. Before we alulate the similarity sore, we delete all the yle in these data sets. These eight data sets ome from four areas. CYPWET and CYPDRY data sets are olleted from 295,000 hetare wetlands of the big ypress natural preserve in southwest Florida. BAYWET and BAYDRY data sets are olleted from a triangular, tropial lagoon/bay. MANG- WET and MANGDRY data sets are from the huge mangrove belt along the seaward edge of the Everglades. GRAMWET and GRAMDRY data sets are from the historial Everglades system. In eah area, the food web data is olleted for different seasons. For example, CYPWET indiates that this data set is olleted in wet season and CYPDRY is for the dry season. Table 5: Classifiation of for food web data sets Data Set C.1 C.2 C.3 C.4 C.5 C.6 C.7 C.8 CYPWET CYPDRY BAYWET BAYDRY MANGWET MANGDRY GRAMWET GRAMDRY Note: The speies in these data sets have been divided into eight lasses by their different roles in eosystem, suh as primary produers, miro fauna, mammals, maro invertebrates, herpetofauna, fishes, detritus and avifauna, whih are marked from C.1 to C.8. Table 5 shows that these speies data sets are manually divided into eight lasses. These lasses will be used as the standard/baseline to evaluate the auray of our algorithms. All our experiments are onduted on a PC with a 3.0 GHz Intel Core 2 Duo Proessor, 2GB memory, running windows XP Professional. All algorithms are implemented in Java. 6.1 Evaluation Metri In our food web data sets, there are predefined lass labels for these speies. For a speies on the food web, these algorithms will return a ranked list of relative speies. For eah speies in the list, if this speies label is the same as speies s 1, we think these two speies are losely related and give a grade 2 (stress the related speies); otherwise we assoiate grade 0. Then, we use the normalized disount umulative gain (NDCG) [5] to evaluate the performane of this similarity ranking list. While evaluating a similarity ranking list, NDCG follows one priniple. The lower ranking position of a speies is less valuable for the researher, beause the researhers take great are about speies more related to speies s 1. Aording to this priniple, NDCG sore of a similarity ranking list at position n is alulated as follows. N(n) = Z n n j=1 2r(j) 1 log(1+j), where r(j) is the rating of the jth speies in the similarity ranked list and the normalization onstant Z n is hosen so that a prefet order gets NDCG value 1. For example, we will alulate the NDCG@10 sore for the speies Living sediment in data set CYPWET. Beause for living sediment there is only one speies in the miro fauna lassifiation, Z n order is 2,0,0,0,0,0,0,0,0,0. We alulate NDCG within 10 related speies for eah speies in eah data set and get the average sore to evaluate the validity of our experiments.

6.2 Experimental Results Parameter Study: Two parameters, and γ affet the auray of similarity sores diretly. These two parameters are appliation dependent.

11 6.2 Experimental Results Parameter Study: Two parameters, and γ affet the auray of similarity sores diretly. These two parameters are appliation dependent. We want to study the available parameters for the food web data. First, we disuss the parameter γ for Additive-SimRank. This parameter is used to deide the importane of two relationships: is-preyed and prey for auray. In this experiment, we fix the damping fator to 0.8 and vary γ from 0 to 1. Figure 6.2 shows that when γ is equal to 0.75, Additive-SimRank will reeive the highest auray. Interestingly, is-preyed relationship is muh more important to deide the speies lassifiation. Seond, we determine the damping fator for these three link-based similarity algorithms. In this experiment, we fix γ = 0.75 and vary from 0.05 to In fat, the effet of damping fator is not very obvious. Figure 6.2 shows that when = 0.8, 0.1 and 0.95, Additive-SimRank, SimRank and Multipliative-SimRank will reeive the highest sores. Thus, for the rest of the experiments, γ is set to 0.75 for Additive-SimRank and is set to 0.8, 0.1 and 0.95 for Additive-SimRank, SimRank and Multipliative-SimRank, respetively. Fig. 3: Parameter γ for Additive-SimRank Fig. 4: Parameter Auray Analysis: In these experiments, we ompare the auray among Multipliative- Jaard [14], Additive-Jaard [14], Multipliative -SimRank, Additive-SimRank andsimrank. Using the rule of additive and multipliative methods, it is easy to design Multipliative-Jaard and Additive-Jaard algorithm. Figure 5 shows the auray of these eight methods for food web data sets. We an see Multipliative-Jaard and Multipliative-SimRank have the lowest auray. Beause SimRank and Additive-SimRank onsider the potential linkage information, these two algorithms are muh better than Additive-Jaard algorithm. Beause Additive-SimRank onsiders both is-preyed and prey relationship, it reahes the best auray. Figure 6 plots the results of NDCG@1 to NDCG@19 for the eah algorithm. Fig. 5: Segmentation of CYPWET data set [3] Fig. 6: NDCG@1 to NDCG@19

12 In Table 6, SimRank shows a better auray for two foodwebs: GRAMWET and GRAMDRY. This is primarily beause we used the γ for all of the data sets although it was derived for the CYPWET data set. When we use the orret γ derived for this, the results are better for Additive- SimRank for these two as well. Considering the ase study, we analyze the top ten similar speies for the speies Roots in CYPWET food web. Beause Roots is the primary produer, it only ontains the is-preyed relationship. The result is shown in table 6. Beause multipliative method is the produt of two relationships similarity sore, Multipliative-Jaard and Multipliative-SimRank an t produe any similar speies for Roots. On the other hand, Additive-Jaard only onsiders the diret relationship, thus it only searhes about eight speies for Roots but no speies are primary produers. The result of SimRank, ontaining seven primary produers, is also very good, but beause Additive-SimRank onsidering both is-preyed and prey relationship, Additive-SimRank searhes eight primary produers, whih is slightly higher than SimRank. Table 6: Case study for speies Roots Multi.-Jaard Additive-Jaard Multi.-SimRank Additive-SimRank SimRank Null Apple Snail Null Cypress Wood Cypress Wood Null Crayfish Null HW Wood HW Wood Null Prawn Null Vine Leaves Vine Leaves Null Aquati Invertebrates Null Cypress Leaves Cypress Leaves Null Vertebrate Det. Null Vertebrate Det. Epiphytes Null Ter. Invertebrates Null Epiphytes Vertebrate Det. Null Refratory Det. Null Float. vegetation Float. vegetation Null Liable Det. Null Marophytes Marophytes Null Null Null Phytoplankton Living POC Null Null Null Living POC Living sediment 7 CONCLUSIONS In this paper, onsidering both prey (out-link) and is-preyed relationship (in-link) on the food web, we propose Additive- and Multipliative-SimRank to alulate the similarity sores. Then, we also disuss the loop problem on the network and propose a method to address this problem. The experimental results onduted on eight food web data sets show that Additive-SimRank outperforms the other approahes with γ equal to 0.75 (reeives the highest sore in the food web). In addition, our methods are also appliable for other information networks, suh as paper itation network and web page network, that have similar harateristis. Referenes 1. Y. Cai, G. Cong, X. Jia, H. Liu, J. He, J. Lu, and X. Du. Effiient algorithm for omputing link-based similarity in real world networks. In Proeedings of the 2009 Ninth IEEE International Conferene on Data Mining, pages , D. Fogaras and B. Rz. Saling link-based similarity searh. In Proeedings of the 14th international onferene on World Wide Web, pages , L. J. Gross. South florida eosystems. atlss/atlss.html. 4. G. Jeh and J. Widom. Simrank: a measure of strutural-ontext similarity. In Proeedings of the eighth ACM SIGKDD international onferene on Knowledge disovery and data mining, pages , K. Jrvelin and J. Keklinen. Cumulated gain-based evaluation of ir tehniques. ACM Transations on Information Systems, 20(4):422446, Otober M. M. Kessler. Bibliographi oupling between sientifi papers. Amerian Doumentation, 14(1):10 25, April A. N. Langville and C. D. Meyer. Deeper inside pagerank. Internet Mathematis, 1(3): , 2004.

13 8. D. Lizorkin, P. Velikhov, M. Grinev, and D. Turdakov. Auray estimate and optimization tehniques for simrank omputation. The VLDB Journal The International Journal on Very Large Data Bases, 19(1):45 66, February L. Lovsz. Random walks on graphs: A survey. Bolyai Soiety Mathematial Studies, 2:1 46, February N. D. Martinez. Artifats or attributes? effets of resolution on the little rok lake food web. Eologial Monographs, 61(4): , Deember N. D. Martinez. Effet of sale on food web struture. siene, 260(5105): , April H. Small. Co-itation in the sientifi literature: A new measure of the relationship between two douments. Journal of the Amerian Soiety for Information Siene, 2(0):28 31, February X. Yin, J. Han, and P. S. Yu. Linklus: effiient lustering via heterogeneous semanti links. In Proeedings of the 32nd international onferene on Very large data bases, pages , P. Yodzis and K. O. Winemiller. In searh of operational trophospeies in a tropial aquati food web. Oikos, 87(0): , February 1999.

Extracting Partition Statistics from Semistructured Data

Extracting Partition Statistics from Semistructured Data Extrating Partition Statistis from Semistrutured Data John N. Wilson Rihard Gourlay Robert Japp Mathias Neumüller Department of Computer and Information Sienes University of Strathlyde, Glasgow, UK {jnw,rsg,rpj,mathias}@is.strath.a.uk