HN-Sim: A Structural Similarity Measure over Object-Behavior Networks

Size: px

Start display at page:

Download "HN-Sim: A Structural Similarity Measure over Object-Behavior Networks"

Ann Baker
6 years ago
Views:

1 HN-Sim: A Structural Similarity Measure over Object-Behavior Networks Jiazhen Nian, Shanshan Wang, and Yan Zhang Department of Machine Intelligence, Peking University Key Laboratory on Machine Perception, Ministry of Education Beijing , P.R. China {nian,shanshanwang}@pku.edu.cn zhy@cis.pku.edu.cn Abstract. Measurement of similarity is a critical work for many applications such as text analysis, link prediction and recommendation. However, existing work stresses on content and rarely involves structural features. Even fewer methods are applicable for heterogeneous network, which is prevalent in the real world, such as bibliographic information network. To address this problem, we propose a new measurement of similarity from the perspective of the heterogeneous structure. Heterogeneous neighborhood is utilized to instantiate the topological features and categorize the related nodes in graph model. We make a comparison between our measurement and some traditional ones with the real data in DBLP 1 and Flickr 2. Manual evaluation shows that our method outperforms the traditional ones. Keywords: Structural similarity measurement, Heterogeneous network, Object-behavior network. 1 Introduction Similarity measurement is quite an important and fundamental study for many practical information retrieval tasks, such as relevance search [2], clustering and even ontology generation and integration [9]. It evaluates the similarity between objects in the relation networks. To deal with the relevance search problem, content-based models are commonly used, with bag-of-words as an instance. Semantic information is learned from the words and text relation is extracted by the comparison of contents [3]. For instance, when we measure the similarity between two authors in a bibliographical information network, the co-author and conference-participation behaviors can be extracted to characterize the similarity. The method gets beyond linguistic analysis, but it still works effectively. Though these state-of-art methods exhibit really well, some other features can be borrowed to further enhance the similarity measurement result. Take the popularization of social network as an example. Behavioral characteristics and social Corresponding author H. Motoda et al. (Eds.): ADMA 2013, Part I, LNAI 8346, pp , c Springer-Verlag Berlin Heidelberg 2013

2 HN-Sim: A Structural Similarity Measure over Object-Behavior Networks 49 relationship often reflect the similarity between users. To study the behaviorof objects, link-based graph model, or object-behavior network are always established, where behavioral objects are the nodes and the particular behavior or relationship turns into the links between them. Hence, the analysis of object behavior is translated into the analysis of this object-behavior network. New similarity measures can be proposed based on structure analysis over object-behavior networks. In graph model, some homogeneous structural features have been extensively studied such as common neighbors and link network [7]. However, objects and relations are always of various types in abundant networks which are called the heterogeneous information networks. In this case, conventional methods cannot work well and thus researchers exert some new ideas of similarity measurement, such as PathSim [11] and HeteSim [10]. In this paper, we tackle the structural similarity measure problem at a general level. Every object in the network is categorized in accordance with their types and attributes. Object-behavior network will be set up and nodes are linked by heterogeneous edges even some of them are not actually connected. We call these connected nods as heterogeneous neighbors. For a particular object, all of its heterogeneous neighbors which are connected with the same relationship constitute a heterogeneous neighborhood. Thus far, the measurement of similarity will be established based on these heterogeneous neighborhoods. Meanwhile, this method is more efficient and cost-saving because it merely takes into account a very small part of data. We conduct some experiments on DBLP data set, calculating the similarity between authors. Besides, a manual evaluation on Flickr is made to compare our method with the conventional measurements. Our contribution can be summarized as follows: we study the structural similarity beyond semantic researches and this method can be applied to measure similarity efficiently in object-behavior networks. Meanwhile, we propose heterogeneous neighborhoods in order to categorize different neighbor objects and extract the semantic information of the relationships. The remainder of the paper is organized as follows. We review the related works in Section 2. The details of ths similairty measurement method are described in Section 3. The experimental results are shown in Section 4. Finally, we conclude our work in Section 5. 2 Related Work The problem of similarity search between structured objects has been studied in the domain of structural pattern recognition and pattern analysis. The most common approach in previous work is based on the comparison of structure. Two objects are considered structurally equivalent if they share many common neighbors. Contrary to homogeneous networks, in heterogeneous information networks neighbors cannot be treated as the same because of the multi-type relations.

3 50 J. Nian, S. Wang, and Y. Zhang In order to utilize the information of structure in heterogeneous information network, Sun et al. advised PathSim [11] about the measurement of the similarity of same-typed objects based on symmetric paths, by considering semantics in meta-path which is constituted by different-typed objects) Shi et al. proposed another measure, called HeteSim [10]. That is based on the theory that relatedness of object pairs is defined according to the search path, which connects two objects through following a sequence of nodes. Some similarity measurements based on link structure are effective such as SimRank [5] and P-Rank (Penetrating Rank) [13]. These methods consider that two vertices are similar if their immediate neighbors in the network are similar. The difference is that SimRank only calculates vertex similarity between sametyped vertices and only partial structural information from inlink direction is considered during similarity computation. P-Rank enriches SimRank by jointly encoding both inlink and outlink relationships into structural similarity computation. Though P-Rank has considered heterogenous relations between vertexes, only directly-linked neighbors are put into computation. Hence some semantic information is lost. Furthermore, both SimRank and P-Rank are iterative algorithms. When the network gets larger, the cost will become heavier. 3 Methodology 3.1 Notation and Definition To formalize our method, we will first give some concepts. Definition 1. Network: A network is defined as a directed graph, noted as G =(V,E). V is the set of vertexes and E is that of edges. In an object-behavior network, every object is abstracted as a vertex and if two objects are related by some a particular behavior or relationship, they will be linked by an edge. Definition 2. Node. A node stands for an object in an object-behavior network, noted as v V. It is represented as a triple <info,type,φ>,whereinfo and type denote the information of this object; Φ is the general set of neighborhoods of node v. We refer to info, type and Φ of v with v.info, v.type and v.φ. Definition 3. Heterogeneous Neighborhood. A heterogeneous neighborhood is actually a set of nodes which describe a kind of topological structural features of an object in an object-behavior network. We formalize it as an infinite set φ. Eachφ is defined as a triple < relation, distance, ν >, whererelation is the description of this relationship and distance is the length of the path from the cynosure to each node v in the node set ν. Each type of links represents a kind of relation(r), and each relation is defined as R R. Function R(v p )={v v V,v and v p are connected by relation R}. In Definition 3, if a heterogeneous neighborhood φ v p.φ is relevant with relation R m. For simplicity, we use subscript relation name to modify the neighborhood, e.g. φ R denotes the φ which relation is R, and it will be inferred that φ Rm.ν = R m (v p ).

Hence, for making a clear introduction, another notation will be defined: regard to a particular node v, every φ v.φ is labeled by φ.distance. Each neighborhood set Φ k, is defined as {φ i φ i Φ, φ i.

4 HN-Sim: A Structural Similarity Measure over Object-Behavior Networks 51 (a) (b) Fig. 1. A heterogeneous information network and a schematic diagram of the relations in a heterogeneous network In this paper, distance plays a very important role. Hence, for making a clear introduction, another notation will be defined: regard to a particular node v, every φ v.φ is labeled by φ.distance. Each neighborhood set Φ k, is defined as {φ i φ i Φ, φ i.distance = k}, whereφ = n k=1 Φk. A heterogeneous neighborhood contains same type of nodes which are connected by the same relation. In homogeneous networks, the relation is only linked or not. Here, relation is expressed by a sequence of related-object pairs. For instance, in Fig. 1(a), Author 2 is Paper 1 s homogeneous neighbor because there is an edge between them. Paper 2 is Paper 1 s heterogeneous neighbor for they can be connected by (Paper 1,Author 2 )and(author 2,Paper 2 ). Fig. 1(b) shows all the feasible relations in that heterogeneous information network. Example 1. Heterogeneous neighborhood of Paper 1 (in Fig. 1(a)). Node Paper 1 : <Paper 1, paper, Φ >, Φ = 3 k=1 Φk. Φ 1 = {φ 0,φ 1, φ 2, φ 3 }. Thereinto φ 0.relation is be written, φ 1.relation is be published in, etc. Now we use P, A, C, T to represent paper, author, conference and terms respectively, meanwhile the relation R bewritten can be noted as PA. For simplification and clearness, we use φ PA instead of φ 0. Hence, φ PA.distance is 1, φ PA.ν is {Author 1, Author 2 }. 3.2 Similarity Measure Based on Heterogeneous Neighborhood Homogenous Similarity Measure Review. The principle of establishing information network is to connect the relevant objects. For instance, two authors are related because they have once collaborated on one paper, and Hawaii is connected to CIKM in bibliographical network because CIKM was once held in Hawaii. Consequently, if two authors share so many common neighbors just like

5 52 J. Nian, S. Wang, and Y. Zhang papers, they must have common research interests. So we believe that structural features fully reveal the similarity between nodes in an information network, and the similar nodes are connect to each other strongly. A variety of similarity measurements based on structural features in homogeneous network have been proposed. Here we review some of the classical similarity measurements. Personalized PageRank (PR): Personalized PageRank [1] is an extension for personalization of PageRank by introducing preference set P. The Personalized PageRank equation is defined as v =(1 c)av + cu. Common neighbors (CN): Common neighbors [1] is defined in graph model as the number of common neighbors shared by two vertex v i and v j,namely Γ (v i ) Γ (v j ). A larger value means these two objects are more similar. Jaccard s coefficient (JC): Jaccard s coefficient is another measure to evaluate the similarity between two vertex v i and v j, which is the normalized value of common neighbors, namely Γ (vi) Γ (v j) Γ (v i) Γ (v. j) HN-Sim Formula. However, the neighbors in different neighborhood or of diverse types always express differently. Taking DBLP as an example, an institute and a conference can be the neighbors of the same papers, but it s equivocal to tell whether they are similar. By analyzing the neighbor ingredient, we can measure the similarity between two nodes in an information network. However, the diversity of neighbors provides different semantic information about nodes. Neighbors in different level represent different properties. In this paper, heterogeneous neighborhood is proposed to categorize the neighbors. Accordingly the similarity of two objects will be measured in each heterogeneous neighborhood separately and finally merged by an influence-based function. As is discussed in Section 3.1, we have known that a node s heterogeneous neighbors can be denoted by sequences of object pairs. While the sequences length can be unlimited, Φ will be infinite, which will lead to a big problem. On one hand, computing too many neighbors may bring in over fitting. Hence, a strategy needs to be settled to decide which neighborhoods should be considered. On the other hand, when the distance of neighborhood gets larger, the object on the neighbor level will probably be reiteration. In response to this issue, experiments and intuitions show that the length of the path used should be positively correlated to the diversity of the system and negatively correlated to thescaleofthenodeset. Meanwhile, since content is not utilized, neighborhoods with same distance will have the same influence in similarity measurement. These neighborhoods have different semantic meanings, however, we can average their similarity contributions in terms of their influence. Therefore the problem of synthesizing the similarity of individual heterogeneous neighborhoods is simplified as synthesizing the similarity of each neighborhood set.

6 HN-Sim: A Structural Similarity Measure over Object-Behavior Networks 53 The similarity between node v 1 and v 2 is defined as Formula 1. HeteroSim(v 1,v 2 )= 1 1+ε k Φ k v 1.Φ { ε + L[ 1 Φ k i R i:φ i Φ k θ Ri (v 1,v 2 )]}. (1) Where function L() controls the influence of each level of neighbors. When it comes to path-drive model, L() can be defined as a power function according to the distance: L[x] =[x] δ. (2) Here δ is decided by the length and node type of the heterogeneous neighborhood. Intuitively, further nodes make less influence to an object. Thus, δ should be positive and it is positively related to φ Ri.distance. Smooth factor ε is employed in case of none common neighbors in some heterogeneous neighborhoods. Function θ R () measures the similarity between two nodes based on the particular neighborhood connected by relation R. In the experiment, Jaccard s coefficient will be used to calculate it. θ Ri (v 1,v 2 )= Γ R i (v 1 ) Γ Ri (v 2 ) Γ Ri (v 1 ) Γ Ri (v 2 ). (3) Where Γ i (v) is the element of φ i,namelyφ i.ν. Homo-Info Adjustment. In some particular homogeneous networks, nodes quality and influence always play important roles to measure the similarity between any of the two entities. Take social network as an example, person with good reputation is always similar with other people who have the same good reputation. So in this step, we will extract some homogeneous information to adjust the heterogeneous similarity. Wang and some other researchers have already proposed several methods in which they use influence to enhance similarity measurement [12]. We extend the idea of influence ranking in homogeneous network as HomoSim to adjust HN-Sim model. The similarity between node v 1 and v 2 is defined as Formula 4. Sim(v 1,v 2 )=HomoSim(v 1,v 2 ) λ HeteroSim(v 1,v 2 ) 1 λ. (4) Here HomoSim shows the homogeneous ranking similarity between these two nodes, and HeteroSim represents basic HN-Sim. By adjusting the weighting parameter λ, we can draw the following conclusions: (1) When λ = 0,HN-Sim only computes heterogeneous neighborhood based similarity; (2) When λ =1, HN-Sim is reduced to homogeneous ranking similarity; (3) Setting λ between 0 to 1, it will balance the leverage between homogeneous ranking bias and the basic HN-Sim. It will be further discussed in Section 4.4. HomoSim is an adjustment function to enhance the HN-Sim. Itcanbe estimated as Formula 5. HomoSim(v 1,v 2 )= 1 [Rank(v 1 ) Rank(v 2 )] 2. (5)

7 54 J. Nian, S. Wang, and Y. Zhang This ranking function in Formula 3 needs to satisfy the following three properties: (i) Global calculations; (ii) Rapid convergence; (iii) Bare content involved. HN-Sim is based on local structure, therefore it has serious limitation in global view. And it should not gain much more cost by involving global calculation. Different networks always hold different quality ranking strategies. In bibliographic network, paper, author and conference are connected to each other. In order to find the papers with the similar quality, PageRank [8] might be used. In Wikipedia 3, there are bare links between entities but entities are linked with numbers of editors. We believe that high-quality authors will edit high-quality entities, and high-quality entities are edited by high-quality authors. Therefor, in Wikipedia networks, HITS [6] will be used as the Rank() function. In addition, content based ranking (CRank) is also suitable. For instance, in IMDB 4 data set, the review score can be taken as the parameter of the ranking function to measure the similarity of two movies. 4 Experiments In this Section, we present our experiments on a bibliographical information network and a social picture-sharing network. 4.1 Datasets Bibliographical network is a typical heterogeneous information network, which is a kind of object-behavior networks. We use our method to analyse the similarity between different authors. In this paper, we use the DBLP dataset downloaded in January 2013 which contains 869,113 papers, 689,177 authors and 1,304 conferences. Fig. 2 shows the number of neighbors in different Φ i.themajorityofthe nodes have less than 10 neighbors in Φ 1, and more than 40% of the nodes have a heterogeneous neighborhood with nodes amount greater than 100. The diversity of heterogeneous neighborhood is revealed in Table 1. For instance, an authortype node will have 18 heterogeneous neighborhoods whose distance is 5. Another dataset is the one used in [11], downloaded from Flickr. Flickr is a web site providing free and paid digital photo storage, sharing, and some other online social services, it is an image hosting and online community. This information network contains images, users, tags and groups. The dataset covers 10,000 images from 20 groups with 10,284 tags, and 664 users. In this experiment we limit the length of the related-pair sequence as 4, that means only Φ 1, Φ 2 and Φ 3 will be calculated. In DBLP dataset, the size of neighborhood set Φ is 9. In Formula 2, δ is the exact length of the neighborhood. For example, δs ofθ RAP and θ RAP C are 2 and 3 respectively

8 HN-Sim: A Structural Similarity Measure over Object-Behavior Networks 55 Fig. 2. Cumulative probability distribution of the scale of heterogeneous neighbors in different distances: Φ 1, Φ 2 and Φ 3 Table 1. The number of heterogeneous neighborhood types in different distances Distance Author Conference Paper Case Study In DBLP dataset, we calculate the similarity between the author Christos Faloutsos and others. The top 10 similar authors with the similarity marks are listed in Table 2. We can see that HN-Sim satisfies self-similarity. The authors found by HN- Sim Model either publish similar papers or have strong connections with him and hold similar research interests. For instance, Philip S. Yu and Christos Faloutsos are both major in data mining and have the similar reputation in this area. Faloutsos and Kleinberg have collaborated on high-level papers. Table 2. Similar authors to Christos Faloutsos Rank Author Similarity 1 Christos Faloutsos Philip S. Yu Jure Leskovec Joseph M. Hellerstein Yufei Tao Divesh Srivastava Jon M. Kleinberg Petros Drineas Yannis Manolopoulos Gerhard Weikum

9 56 J. Nian, S. Wang, and Y. Zhang 4.3 Result and Evaluation To evaluate the effect of our method, we conduct another experiment on Flickr data set. Use, tag, category information and a brief description is given to each image. We extract the relation between images with tags and groups, constructing a heterogeneous information network. I is used to represent image, T and G represent tag and group respectively. Fig. 3. Top 5 similar images in Flickr found by different methods The relations in this object-behavior network can be combined with these three initial relations, R I T : tag is assigned to image; R I G : image is categorized as group; R T G : tag is assigned to group. To measure the similarity between two images, we limit the neighborhood distance under 3. For another example, Φ 2 ={φ ITI, φ IGI, φ ITG, φ IGT }. δ informula2isset as each distance of φ. We use our method to find the similar pictures. Common neighbors(cn), Jaccard s coefficient(jc) and P-PageRank(PR) are used as adversaries. The result is shown in Fig. 3. To evaluate the performance of the measurement based on heterogeneous neighborhood, we calculate normalized Discounted Cumulative Goal (ndcg) [4]for each baseline method. NDCG is the most common way to evaluate the search result quality, the expression of DCG is as follows. DCG p = p i=0 2 reli 1 log 2 (i +1) Where p is the position of an image in a particular rank witch made by volunteers, and rel i is the relevance values of image. We invite 6 volunteers to rank 144 images by the similarity with the first image in Fig. 3. This rank is defined as ground truth. The relevance function rel in ndcg formula is defined as follows. (6)

10 HN-Sim: A Structural Similarity Measure over Object-Behavior Networks 57 rel i = relevance( rank i goldrank i ) (7) Where rank i is the position of image i ranked by experimental method and goldrank i is that of ground truth. The values are mapped to discrete numeric (Table 3). Table 3. Relevance mapping [0,1] [2,5) [5,10) [10,144] relevance For a comparison, we normalize the DCG p of result by dividing IDCG p which is calculated by golden standard rank. Fig. 4. shows the performance of all the methods we use. Our method achieves the best performance. Homogeneous methods with structural features are not stable and perform not well in heterogeneous information network. This validates that structural similarity is significant and it should be applied in heterogeneous way. Fig. 4. ndcg of each similarity measure in Filckr dataset 4.4 Homo-Info Adjustment Discussion Basic HN-Sim is based on local structure. It outperforms conventional methods in the experiments on DBLP and Flickr datasets. As is introduced in Section 3.2, basic HN-Sim can be enhanced by involving global computation. In this section, we integrate PageRank, Crank, and HeteroPageRank(PageRank in heterogeneous) to HN-Sim. Fig. 5 provides the the average results of different λ in terms of 65 similar images on the Flickr data sets. There is little difference between all other methods at ndcg@15 while basic HN-Sim performs outstandingly. However we would like to point out that after ndcg@15, the integration of global information enhances the results. To prove the effects of the incorporation of heterogeneous and homogeneous information, we sampled some movies on IMDB 5 to continue another experiment. λ is set as 0.4 and PageRank is used as the global homogeneous ranking. We calculate 5

11 58 J. Nian, S. Wang, and Y. Zhang (a) λ =0.2 (b) λ =0.4 (c) λ =0.6 (d) λ =0.8 Fig. 5. The experimental results in terms of NDCG Fig. 6. ndcg of IMDB dataset the most similar movies to Avatar 6, five volunteers were payed to rank the similar movies from a candidate set as the ground truth. The ndcg@20 is shown in Fig. 6. We notice that in this experiment, the homogeneous information does not improve the top 10 results though it does help the overall effects. 5 Conclusions In this paper, we propose a novel method to measure the similarity between objects from the perspective of structure. Heterogeneous neighborhood is borrowed 6

12 HN-Sim: A Structural Similarity Measure over Object-Behavior Networks 59 to categorize heterogeneous neighbors, which contain more structural semantic information than initial neighbors. Compared with conventional methods, our HN-Sim measure puts more focus on topological features. The experimental results show that our method performs better. As to future works, we will make more studies on the extraction of heterogeneous neighborhood and a formalized level-chosen pattern will be given. The aggregation of distance measure of each neighbor set may be more diversified in consideration of different data sets and the semantic information of the data. Acknowledgements. This work was supported by NSFC with Grant No and , and 973 Program with Grant No.2014CB References 1. Chakrabarti, S.: Dynamic personalized pagerank in entity-relation graphs. In: Proceedings of the 16th International Conference on World Wide Web, pp ACM (2007) 2. Hatzivassiloglou, V., Klavans, J.L., Eskin, E.: Detecting text similarity over short passages: Exploring linguistic feature combinations via machine learning. In: Proceedings of the 1999 Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp Citeseer (1999) 3. Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data (TKDD) 2(2), 10 (2008) 4. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) 20(4), (2002) 5. Jeh, G., Widom, J.: Simrank: a measure of structural-context similarity. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp ACM (2002) 6. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM) 46(5), (1999) 7. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology 58(7), (2007) 8. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web (1999) 9. Ruotsalo, T., Hyvönen, E.: A method for determining ontology-based semantic relevance. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA LNCS, vol. 4653, pp Springer, Heidelberg (2007) 10. Shi, C., Kong, X., Yu, P.S., Xie, S., Wu, B.: Relevance search in heterogeneous networks. In: Proceedings of the 15th International Conference on Extending Database Technology, pp ACM (2012) 11. Sun, Y., Han, J., Yan, X., Yu, P.S., Wu, T.: Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. In: VLDB 2011 (2011) 12. Wang, G., Hu, Q., Yu, P.S.: Influence and similarity on heterogeneous networks. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp ACM (2012) 13. Zhao, P., Han, J., Sun, Y.: P-rank: a comprehensive structural similarity measure over information networks. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp ACM (2009)

HBGSim: A Structural Similarity Measurement over Heterogeneous Big Graphs

2014 IEEE International Conference on Big Data HBGSim: A Structural Similarity Measurement over Heterogeneous Big Graphs Jiazhen Nian Department of Machine Intelligence Peking University Beijing 100871,