Link Label Prediction in Signed Citation Network. Thesis by Uchenna Akujuobi. In Partial Fulfillment of the Requirements.

Size: px

Start display at page:

Download "Link Label Prediction in Signed Citation Network. Thesis by Uchenna Akujuobi. In Partial Fulfillment of the Requirements."

Christian Lane
5 years ago
Views:

1 Link Label Prediction in Signed Citation Network Thesis by Uchenna Akujuobi In Partial Fulfillment of the Requirements For the Degree of Masters of Science King Abdullah University of Science and Technology, Thuwal, Kingdom of Saudi Arabia (March, 2016)

2 2 The thesis of Uchenna Akujuobi is approved by the examination committee Committee Chairperson: Prof. Xiangliang Zhang Committee Member: Prof. Mikhail Moshkov Committee Member: Prof. Xin Gao King Abdullah University of Science and Technology 2016

4 4 Link Label Prediction in Signed Citation Network Uchenna Akujuobi Abstract Link label prediction is the problem of predicting the missing labels or signs of all the unlabeled edges in a network. For signed networks, these labels can either be positive or negative. In recent years, di erent algorithms have been proposed such as using regression, trust propagation and matrix factorization. These approaches have tried to solve the problem of link label prediction by using ideas from social theories, where most of them predict a single missing label given that labels of other edges are known. However, in most real-world social graphs, the number of labeled edges is usually less than that of unlabeled edges. Therefore, predicting a single edge label at atimewouldrequiremultiplerunsandismorecomputationallydemanding. In this thesis, we look at link label prediction problem on a signed citation network with missing edge labels. Our citation network consists of papers from three major machine learning and data mining conferences together with their references, and edges showing the relationship between them. An edge in our network is labeled either positive (dataset relevant) if the reference is based on the dataset used in the paper or negative otherwise. We present three approaches to predict the missing labels. The first approach converts the label prediction problem into a standard classification problem. We then, generate a set of features for each edge and then adopt Support Vector Machines in solving the classification problem. For the second approach, we formalize the graph such that the edges are represented as nodes with links showing similarities between them. We then adopt a label propagation method

5 5 to propagate the labels on known nodes to those with unknown labels. In the third approach, we adopt a PageRank approach where we rank the nodes according to the number of incoming positive and negative edges, after which we set a threshold. Based on the ranks, we can infer an edge would be positive if it goes a node above the threshold. Experimental results on our citation network corroborate the e cacy of these approaches. With each edge having a label, we also performed additional network analysis where we extracted a subnetwork of the dataset relevant edges and nodes in our citation network, and then detected di erent communities from this extracted subnetwork. To understand the di erent detected communities, we performed a case study on several dataset communities. The study shows a relationship between the major topic areas in a dataset community and the data sources in the community.

6 6 ACKNOWLEDGEMENTS I would like to thank God for his grace leading me thus far. I would like to acknowledge my parents, Chief and Dr Njoku whose support and prayers have kept me going. I will also like to acknowledge my supervisor Prof. Xiangliang Zhang, whose constant advice and corrections has kept me on the right path. Also, I acknowledge King Abdullah University of Science and Technology for granting me this study opportunity. I would also like to thank the Computer Science department professors who had taught me one course or the other, bestowing knowledge upon me, without which I wouldn t have been able to succeed.

7 7 TABLE OF CONTENTS Examination Committee Approval 2 Copyright 3 Acknowledgements 6 List of Abbreviations 9 List of Symbols 10 List of Figures 11 List of Tables 13 List of Algorithms 14 1 Introduction Signed Network Link Label Prediction Problem Setting Prior Work Structural Balance Theory Social Status Theory Page Rank Label propagation Link Prediction Using Propagation Link Label Prediction Methodology Problem Definition Supervised Method Features and methodology

8 Support Vector Machines (SVM) Label Propagation Method Label propagation Theory PageRank-based Method PageRank Theory Experimental Results Data Collection Dataset description Results Community Network Analysis Conclusion and Future Work Conclusion Future Work References 74

9 9 LIST OF ABBREVIATIONS AUC Area Under Curve ICDM International Conference on Data Mining IEEE KDD ACM SIGKDD Conference on Knowledge Discovery & Data Mining MF-LiSP Matrix Factorization for Link Sign Prediction RBF ROC RPR Radial Basis Function Receiver operating characteristic Rooted Page Rank SDM SNSs SVM SIAM International Conference on Data Mining Social Network Sites Support Vector Machines T-RPR Time-sensitive Rooted-PageRank UCI UCR University of California, Irvine University of California, Riverside

10 10 LIST OF SYMBOLS Alpha indicates a variable A parameter for SVM 2 indicates a range µ mu indicates a variable or parameter mapping function mapping function A page rank vector < Data points Sigma indicates a summation sigma indicates a variable or parameter Zeta indicates slack variables

11 11 LIST OF FIGURES 1.1 Relationship between papers in a signed citation network. The local vicinity of each edge from Paper C comprises of an edge from paper D Structural balance: Each triangle must have 1 or 3 positive edges to be balanced. Figure 2.1 (a) and (c) are balanced, while Figure 2.1 (b) and (d) are balanced triads determined by three nodes and a fixed unsigned edge (from A to B) Example of a link label prediction problem with one unknown label Example of a linear separating hyperplane of a separable problem in a 2D space. Support vectors (circled) show the margin of the maximal separation between the two classes A non-linearly separable hyperplane in a 2D space mapped into a new feature space where data are linearly separable A sample graph G. (a) Original graph G given as input to the algorithm, (b) Resulting graph G? produced as output of the algorithm PageRank score. (a) Resulting rank scores of Normal PageRank algorithm, (b) Resulting rank scores of our method, the red line shows the threshold A Directed graph representing a small web of six web pages The number of collected papers for each of the three conferences from 2001 to Distribution of the number of citation Full citation graph Performance Results of the evaluated approaches ROC curve of the di erent approaches Result of the evaluation on unseen papers Topic Cloud of the four largest communities

12 Topic Cloud of the two selected communities The largest community. (a) UCI dataset repository - C.L. Blake and C.J. Merz, (b) UCI KDD archive - S.D. Bay (c) WebACE: A Web Agent for Document Categorization and Exploration - Han et al The second largest community. (a) UCI dataset repository - A. Asuncion and D. Newman The third largest community. (a) RCV1: A New Benchmark Collection for Text Categorization Research - Lewis et al. (b) NewsWeeder: Learning to Filter Netnews - Lang et al. (c) Gradient-based Learning Applied to Document Recognition - Lecun et al. (d) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring - Golub et al The fourth community. (a) GroupLens: An Open Architecture for Collaborative Filtering of Netnews - Resnick et al. (b) Cascading Behavior in Large Blog Graphs - Leskovec et al. (c) Grouplens: Applying Collaborative Filtering to Usenet News - Konstan et al The fifth community. (a) The UCR time Series Data Mining Archive - Keogh and Folias (b) Emergence of Scaling in Random Networks - Barabasi et al. (c) GThe UCR Time Series Classification/Clustering Homepage - keogh et al. (d) Exact Discovery of Time Series Motifs - Abdullah et al. (e) - Keogh

13 13 LIST OF TABLES 4.1 Driected Network Information Dataset Network Information

14 14 List of Algorithms 1 Graph Convertion Algorithm Label Propagation, Zhu et al Label Spreading, Zhou et al

15 15 Chapter 1 Introduction The emergence and rapid growth of Social Network Sites (SNSs) such as Twitter, LinkedIn, ebay and Epinions have greatly increased the influence of the internet in everyday life. People now depend more on the web for information, social interactions, decision making, event planning, sales, etc. Due to this rapid increasing amount of interaction, social network analysis has been attracting lots of attention during the recent years. A considerable amount of these analyses has been on mining and analyzing interesting user behaviors with the scope of enhancing user experience. Based on similar representation, our study is on citation graphs. Citation graphs are used to visualize and analyze papers or researchers and can be utilized in several areas such as to analyze and visualize the performance of researchers, relationships between di erent papers, rank conferences, rank papers and datasets, find and monitor communities and the prevalent topics within them, etc [1] [2] [3]. Throughout this thesis, we will use edge and link interchangeably. We will also use the terms sign and label interchangeably.

16 1.1 Signed Network 16 Online social networks have traditionally been represented as graphs with positively weighted edges representing interactions amongst entities and the nodes representing the entities. However, this representation is inadequate since there are also negative e ects at work in most network settings. For instance, in online social networks like Slashdot and Epinions, users often tag others as friends or foes, give ratings to other users or items [4]; and on Wikipedia, users can vote against or in favour of the nomination of other users to be admins [5]. In a binary signed network, edges are given positive or negative labels showing the relationship between two nodes. However, the complexity and nature of many graph problems change once negatively labeled edges are introduced. For example, the shortest-path problem in the presence of cycles with negative edges is known to be NP-hard [6]. Several studies have been conducted on binary signed networks with di erent algorithms proposed using methods such as structural balance theory, social status theory, matrix factorization, trust and distrust propagation, and regression [4] [7] [8] [5]. In our directed citation network, the direction of the edges is from a paper to its reference (see Figure 1.1). We define its sign to be negative or positive based on its dataset relevancy (i.e whether the citation is based on the dataset used in the paper). The underlying question is then: Does the pattern of edge signs in the local vicinity of a given edge a ect the sign of the edge? The local vicinity of an edge comprises of other edges having the same target node as the edge. Knowing the answer to these questions would provide an insight on the design of computing models where we try to infer the unobserved link relationship between two papers using the negative and positive relationship we have observed in the vicinity of this link.

17 17 Figure 1.1: Relationship between papers in a signed citation network. vicinity of each edge from Paper C comprises of an edge from paper D The local 1.2 Link Label Prediction Considering a directed graph, the label of a link can be defined to be positive (representing a positive demeanor of its originator towards the receiver) or negative (representing a negative demeanor of its originator towards the receiver). In a citation network, the sign can denote similarities between a pair of nodes base on properties such as datasets, a liation, research interest, etc. Studies on social networks based on social psychology [9] [10] [11] [12] [4] have shown that future unknown interactions and perceptions of an entity towards another entity tend to be a ected by its current interactions or perceptions towards other entities. For instance, if user A is known not to trust user B, and user B is known to trust user C, it is likely that user A would not trust user C also. Considering online shops such as ebay and Amazon, for instance, if everyone that purchased item A also purchased item B, it can infer that if user X buys item A, then user X is also likely to buy item B. Therefore, understanding the latent tension between the positive and

18 18 negative forces is very important in order to solve some networking problems like recommending friends to users in a social network, propagation of trust and distrust, recommending datasets/papers in a citation network, prediction, etc. The aim here is to try to examine the relationship between negative and positive links in a network. For this, we need to know the theories of signed networks, which will enable us to reason and understand how di erent configurations of positive and negative edges provide information for the explanation of the various interactions in the network [4]. There are two most popular theories that have been used for positive and negative relationship prediction on social networks: Social status theory [12] [13] and structural balance theory [10] [14]. Structural balance theory was introduced in the mid-20th-century in social psychology. This theory considers the possible ways a triad (a triangle on three entities) can be signed. The main idea of this theory is that triads with one or three positive signs (two friends with one enemy or three mutual friends) are balanced and thus, more plausible than triads with none or two positive signs (three mutual enemies or two enemies with a common friend) which are unbalanced. Social status theory introduced in [12] takes into account the direction and sign of edges and posits that a negative directed link implies that the initiator views the recipient as having a lower status while a positively directed link implies that the initiator views the recipient as having a higher status. Thus, an entity will only connect to another entity with a positive sign only if it views it as having a higher status. Note that although these theories work well and have been proved useful, neither of them has been able to explain the situation where the two nodes connected by an observed edge have no mutual neighbor [15]. Generally, the edge sign prediction problem in most of these existing studies [11] [4] [12] [7] [8] can be formalized as follows: Given a network with signs on all its edges, but the

19 19 sign on the edge from node u to node v, denoted as s(u, v), the aim is to predict the missing sign on the edge s(u, v) [4]. Our goal in this thesis, however, is to solve the link label prediction problem on our citation network. The link label prediction problem is defined as follows: Given the information about signs of certain links in a social network we want to predict, which reveals the nature of relationships that exist among the nodes, we want to predict the sign, positive or negative, of the remaining links [16]. Agrawal et al. [16] introduced the link label prediction problem and proposed a matrix factorization based technique MF-LiSP (Matrix Factorization for Link Sign Prediction). The matrix factorization method proposed relies on a prior knowledge of the network (i.e. if it is a balanced network, semi balanced network, etc.). This method uses this information to try to complete the matrix in a way so as to keep the structural balance structure. 1.3 Problem Setting Our problem setting is described as follows: Given a citation graph G =(V,E)where, the set of nodes V is a set of papers and data sources, the set of edges E is the set of relationship between pairs of nodes. Let L u,v denote the type of relationship between pairs of nodes. We focus on two types of relationships: dataset (positive) and nondataset relevant (negative) relationships. A link from node u to node v is said to be dataset related if the paper or data source represented by node v is cited by the paper represented by the node u based on the datasets used or available in v and vice-versa. Given the information about the relationship nature of certain links in a citation network, we want to learn the nature of relationships that exist among the nodes by predicting the sign, positive or negative, of the remaining links. To solve

20 20 the problem of link label prediction on our signed citation network, We present three approaches which are respectively, based on SVM, PageRank algorithm, and label propagation algorithm. To study the link label prediction problem, a step-by-step research approach was followed. First, we mined the dataset used here from the DBLP computer science bibliography website [17]. We selected three conference datasets - International Conference on Data Mining IEEE (ICDM), SIAM International Conference on Data Mining (SDM), and ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD) ranging from 2001 to From the mined dataset, we extracted the dataset relevant references and constructed a citation graph with directed edges from papers to papers or data sources showing that the paper is referencing the papers or data sources. If the reference is a dataset relevant reference, we assign a positive label to the edge else we assign a negative label. Then the following tasks were accomplished as part of this research: Convert the link label prediction problem to a standard binary classification problem by generating sets of features for each edge and adopting SVM to solve the classification problem. Convert the link label prediction problem to a label propagation problem by formalizing the graph in such a way that the edges are transformed to labeled and unlabeled nodes in the converted graph. In the new graph, the edges show the similarities between the nodes. This similarity is based on the nodes the edges are linked to in the original graph. We then, adopt a label propagation approach by propagating labels from known nodes to nodes with unknown labels. Adopt a PageRank approach in solving the link label prediction problem by

21 21 ranking the nodes by the positive and negative incoming edges such that incoming positive edges add to the rank of a node while negative incoming edges reduced the rank of a node. Then we find a separating threshold that separates dataset relevant nodes and non-dataset relevant nodes. Evaluate proposed algorithms for link label prediction problem on our citation network. Extract a subgraph comprising of dataset relevant nodes and edges. Discover and study the communities in the extracted subnetwork. This thesis is structured as follows: In chapter 2, we discuss related works based on the algorithms and similar ideas used in this thesis. We also discuss previous works on link label prediction that tries to predict links using ideas from label propagation. In chapter 3, we discuss the methodologies used in this thesis and how we approached the problem. We also discuss in detail how the algorithms used in each approach works. In chapter 4, we discuss the result gotten from the evaluation of these algorithms and show them side by side for comparison. We also discuss the network analysis on the dataset relevant subgraph. Finally, we conclude and propose future research direction.

22 22 Chapter 2 Prior Work A number of papers have investigated the positive and negative relationship between members on signed networks using di erent approaches. Guha et al. [12] extended the already existing works based on trust propagation [18] [19]. Their work introduced the distrust propagation. The trust and distrust propagations are obtained by calculating alinearcombinationofpowersofacombinedmatrix.thiscombinedmatrixismade up of the co-citation, trust coupling, direct propagation and transpose trust matrices. This propagation method however, cannot be explained by the structural balance theory. Based on the social status and structural balance theory, Leskovec et al. [7] extended the work of Guha et al. [12] by using a machine-learning framework approach. They evaluate which of a range of structural di erence give more information for prediction task. In their work, they define two classes of features for the machine learning approach. The first class is based on the signed degree of the nodes and the second class is based on the social structure principles using a triad. Assuming we are trying to predict the sign of the edge from u to v. For the first class of features, they construct seven-degree features based on the outgoing edges from u, incoming edges to v and the embeddedness of the edge. These seven features are d + in (v) andd in (v)

23 23 which denotes the number of incoming positive and negative edges to v, C(u, v) which denote the total number of common neighbours of u and v in an undirected sense (embeddedness of the edge), d + out(u) andd out (u) whichdenotetheoutgoingpositive and negative edges from u, the total out-degree of u and the total in-degree of v. For the second class, they considered each triad involving the edge (u, v), consisting of anodew (see Figure 2.2 ) and encoded the information in a 16-dimensional vector specifying the number of triads of each type that (u, v) isinvolvedin. Then,using a logistic regression classifier, they combined the information from the two classes of features into an edge sign prediction. Chiang et al. [8] extended the work of Leskovec et al. [7] by considering not just triads, but longer cycles. In their work, they ignored the edge directions so as to reduce the computational complexity. They extended the supervised learning approach of [7] by using features derived from longer circles [8]. However, Agrawal et al. [16] is the first to tackle simultaneous label prediction of multiple links. They formulated the link sign prediction problem as a matrix completion problem in a setting, where the data is represented as a partially observed matrix [16]. In their work, they proposed a new technique Matrix Factorization for Link Sign Prediction (MF-LiSP) that is based on matrix factorizations. Note, many approaches based upon social status theory or structural balance theory have been developed to perform edge sign prediction in signed networks. However, they generally do not perform well in networks with few topological features (i.e. long-range cycles and triads) [15]. The most popular theories in the study of signed social network are that of social status and structural balance. However, there are some methods used in link prediction problem that can be extended to solve the link label prediction problem. Prior works based on the approaches used in this thesis are described in sections below.

24 24 Figure 2.1: Structural balance: Each triangle must have 1 or 3 positive edges to be balanced. Figure 2.1 (a) and (c) are balanced, while Figure 2.1 (b) and (d) are balanced 2.1 Structural Balance Theory The formulation of the structural balance theory is based on social psychology [10]. Based on this theory, Cartwright and Harary [11] formally provided and proved the notion of structural balance using undirected triads and mathematical theories of graphs. Their notion of structural balance has four basic intuitive explanations each representing a particular structure. They further claim that these four structures can be divided into two groups, namely: balance and unbalanced (see Figure 2.1). The intuitive explanations for the balanced structures are: A, B and C are all friends (Figure 2.1(a)); B and C are friends with A as a mutual enemy (Figure 2.1(c)). The intuitive explanations for the unbalanced structures are: A is friends with B and C, but B and C are enemies (Figure 2.1(b)); A, B and C are mutual enemies (Figure 2.1(d)). In other words, the balanced triads follow the rule: the friend of my friend is my friend (Figure 2.1(a)) and the enemy of my enemy is my friend, the friend of my enemy is my enemy and the enemy of my friend is my enemy (Figure 2.1(c)). Structural balance theory was initially intended for undirected networks but has been

25 25 Figure 2.2: 16 triads determined by three nodes and a fixed unsigned edge (from A to B).

26 applied to directed network by disregarding the edge direction [4] Social Status Theory Social status theory was motivated by the work of Guha et al. [12] while considering the edge sign prediction problem. Leskovec et al. [4] developed the social status theory using the structural balance theory and ideas from Guha et al. [12] to explain the signed networks. In this theory, they consider the direction and sign of an edge. Consider a link from A to B. If the link from A to B has a positive sign, it indicates that A views B as having a higher status and otherwise, if the link has a negative sign, it indicates that A views B as having a lower status. In this theory of status, A will positively connect to B only if it views B as having a higher status, and negative otherwise. These relative levels of status can be propagated across the network [12]. Assuming all nodes in the network agree to the social status ordering and the sign of the link from A to B is unknown, we could infer the sign from the contexts provided by the rest of the network. 2.3 Page Rank The PageRank algorithm introduced by Brin and Page [20] shows the importance of a node in a network based on the importance of the nodes that link to it. Some works [21] [22] on link prediction adapted what is known as rooted PageRank approach to their studies. In the rooted PageRank approach, the random walk assumption of the original PageRank is altered as follows: Similarity score between two vertices u and v can be measured as the stationary probability of v in a random walk that returns to u with probability 1 in each step, moving to a random neighbor with probability

27 27 [23]. However, this metric is asymmetric. To make this metric symmetric, the counterpart where the role of u and v are reversed [23].Given a diagonal degree matrix D with D[i, i] = j A[i, j]. RP R =(1 )(I N) 1 (2.1) where, N = D 1 A is the adjacency matrix with row sums normalized to 1. Nowell et al. [21] assigned a connection weight score(u, v) to pairs of nodes(u, v) based on an input graph G. Then from this score, they produced a ranked list in decreasing order of score(u, v). This can be seen as computing a measure of similarity or proximity between pairs of nodes (u, v), based on the network topology. The random reset procedure in web page ranking was adopted. However, this work focused on link prediction problem which, is di erent from our problem setting. Jabar et al. [22] proposed a new approach Time-sensitive Rooted-PageRank (T- RPR) which is built upon Rooted-PageRank. This method captures the interaction dynamics of the underlying network structure. They define the problem as: Given a time-interaction graph G k i =(V k,e k,w), where V k are the nodes, T k is the time slot in which there was an interaction between the set of nodes (in their case actors), and W is the edge weight vector. Let score x,k (v i )denotetherprscoreofnodev i when x is used as root on G k i. The resulting list of nodes are sorted in descending order with respect to their RPR scores at time slot k: L(x) k =[v 1,v 2,..,v n ], (2.2) where n = V k 1,V k \{x} = {v 1,v 2,...,v n },andscore x,k (v i ) score x,k (v i+1 ) for i =1,...,n 1. The rankings obtained over the n time slots are then aggregated

28 28 for each root node x and all remaining nodes v i 2 V, resulting in an aggregate score aggscorex(v i ). Finally, just like in [21] the aggregate scores are sorted in descending order. Finally, from the list of nodes L(x), the hierarchy graph G H is inferred. Their work was focused on detecting hierarchical ties between a group of members in a social network. Note that this is di erent from the objective of this thesis. Though the methods used a similar idea of scoring and ranking nodes, there is no prior work on using PageRank to infer the labels of links in a signed network. We will use PageRank to rank the nodes and based on their ranks, we will infer the label of an edge directed towards a node. 2.4 Label propagation Label propagation introduced in [24] is a semi-supervised algorithm. Given a graph with labeled and unlabeled nodes, label propagation di uses the labels recursively across graph edges from the labeled to unlabeled nodes, until a stable result is reached. This results in a final labeling of all the graph nodes. Because of its e ciency and performance, it has been applied to several problems with modified algorithms being proposed [25] [24] [26] [27]. Several works [28] [29] applied label propagation on community detection problem. Some other works [30] [31] [32] extended the works on community detection and proposed algorithms based on label propagation in finding overlapping communities where each vertex can belong to up to v communities, where v is a parameter of the algorithm. Boldi et al. [33] applied label propagation in clustering and compressing very large web graphs and social networks. They proposed ascalableparameter-freemethod(layered Label P ropagation) thatkeepsnodes with the same labels close to each other at each iteration and yields a compression friendly ordering. On a similar note, Ugander et al. [34] proposed an e cient algo-

29 29 rithm (Balanced Label P ropagation) for partitioning massive graphs while greedily maximizing edge locality. There are also several works [35] [36] [37] [38] on classification and predictions on social networks using label propagation. However, to the best of my knowledge, there is no previous method proposed for solving the link label prediction on signed graph using label propagation. We present a method formalizing agraph,suchthatwepropagatelabelsfromlabeledtounlabelededges. 2.5 Link Prediction Using Propagation A similar idea of propagating labels on links was used in the algorithm proposed by Kashima et al. [39]. They used ideas from label propagation method to predict links. Using ancillary information such as node similarities, they devised a nodeinformation based link prediction method using a conjugate gradient method and vec-trick techniques to make the computations e cient. Given two sets of nodes X = {x 1,x 2,...x M } and Y = {y 1,y 2,...y N }, note that M = X and N = Y. The link propagation inference principle which was gotten from modifying the label propagation principle, states that two similar node pairs are likely to have the same link strength. The link strengths indicate how likely a link exists for a given a node pair (x i,y j ), and are set to some positive value if a link exists in the observed graph, some negative value if no link exists in the observed graph, or zero otherwise. Applying the label propagation principle, they defined an objective function to minimize: J(F )= 2 vec(f ) T Lvec(F )+ 1 2 k vec(f G) vec(f ) k 2 2 (2.3) 8 9 >< 1 if (i, j) 2 E >= G i,j = >: p µ otherwise >;

30 30 where F is a second order tensor representing link strength (vec(f)isthevectorization operation of the matrix F ), F represents the observed parts of the network. The first term of Eq.(2.3) indicates that two link strength values F i,j and F l,m for two pairs of edges (x i,y j )and(x l,y m )shouldbeclosetoeachotherifthesimilaritybetweenthem is high. The second term is the loss function. The loss function fits the predictions to their target values of the observed regions of the network. The second term also acts as a regularization term so as to prevent the predictions from being too far from zero, and for numerical stability (the is the Hadamard product). The > 0 and µ> 0areregularizationparameterswhichbalancethetwotermsofEq. (2.3). L is a MN MN Laplacian matrix. To obtain F that minimizes Eq.(2.3), Eq.(2.3) is di erentiated with respect to vec(f )[39]. Thismethodcanfillinmissingparts of tensors, and thus, is applicable to multi-relational domains, making it possible to handle multiple types of links simultaneously [40]. Although the qualities of this method were better compared to those of other state-of-the-art methods, its e ciency and e ectiveness were overshadowed by its computational time and space constraints thereby, limiting its application. To address this issue, Raymond et al. [41] extended [39], proposing a fast and scalable algorithm for link propagation. This proposed algorithm utilizes matrix factorization and approximation techniques (such as the Cholesky and eigen-decomposition) to reduce the computational time and space required in solving linear equations in the Link Propagation algorithm. Note, however, this is solving the link prediction problem and is di erent from our problem.

31 31 Chapter 3 Link Label Prediction Methodology In this section, we discuss the approaches used in solving our link label prediction problem. 3.1 Problem Definition Link label prediction problem is defined as: Given a citation graph G = (V,E) where, the set of nodes V is the set of papers or data sources, the set of directed edges E is the set of relationship between pairs of nodes. An edge pointing from u to v indicating v is cited in u. We let L u,v denote the label of the edge (u, v) from node u to node v, representing the relationship between u and v. We focus on two types of relationship: dataset relevant (positive) and non-dataset relevant (negative) relationship. A link from node u to node v is said to be dataset relevant if node v is cited by node u based on the dataset it uses and vice-versa. In this thesis, we will address the node relationship using their graphical terms (labels or signs). Given the information about the relationship nature of certain links in a citation network, we want to learn the nature of relationships that exist among the nodes by predicting

32 32 the sign, positive or negative, of the remaining links. We assume that: A link (u, v) from node(u) toanode(v) ismostlikelytobepositiveifnode(v) isknowntohavea higher volume of its incoming links to be positive and on the other hand, if node(v) is known to have a higher volume of its incoming links to be negative, then link (u, v) is most likely to be a negative link. This problem can thus be formulated as follows: Given a citation graph G =(V,E), and a set of label L u,v = { 1, 1} for some edges, we predict the missing labels for other edges in the graph. We next introduce three di erent approaches for solving this problem. 3.2 Supervised Method Here, we consider the link label prediction problem as a classification problem. For each edge, we generate a set of features based on the network topology, structural balance and status theory. Having obtained these features, we convert the link label prediction problem to a binary classification problem: labeled edges are used for training a classifier that will predict positive or negative relationship for unlabeled edges. We adopt Support Vector Machines (SVM) in solving this binary classification problem due to its well-known e cacy in classification. A broad description of how SVM works can be found in section Features and methodology A critical part of any machine-learning algorithm is choosing an appropriate feature set. Making a poor feature set selection might have negative e ects on the result learned by many machine-learning algorithms. We divided the features into two categories. The first category is based on the signed degree of the nodes and the second category is based on the signed degree of common neighbors.

33 33 For an edge (u, v), pointing from node u to node v, we define the following features in the first category. 1. d + in (v) andd in (v) which denote the number of incoming positive and negative edges to v 2. d + out(u) andd out (u) whichdenotetheoutgoingpositiveandnegativeedgesfrom u 3. The total degree of u and the total degree of v which are d(u) andd(v) respectively 4. C(u, v) whichdenotetheembedednessoftheedge. Wedescribetheembededness as the total number of mutual neighbors of u and v in a semi-directed sense that is, the number of nodes w such that w i is linked by an edge in either direction with u and linked to node v in a directed sense (i.e. from w to v) 5. H(u, v) which infers the sign of the edge based on social status theory. Node status is calculated as follows: (x) =d + in (x)+d+ out(x) d in (x) 6. T (v) whichisthenumberofdatasetrelatedwordscontainedinthepapertitle or data source name represented by the target node v The status heuristic (x) givesnodex status benefits for each positive link it receives and each positive link it generates, and status reduction for each negative link it receives. We then predict a positive sign +1 for (u, v) if (u) < (v), a negative sign -1 if (u) > (v), and a 0 otherwise. The list of dataset related word was gotten from observing the publications and data sources. For the second class, we consider each node w such that w has an edge to or from u and also an edge to v. Here, we used the idea that if nodes similar to node u link

34 34 to node v with a positive sign, then u is most likely to link to v with a positive sign or with a negative sign if nodes similar to node u link to v with a negative sign. The features for an edge(u,v) in the second category are 1. N + -Thenumberofmutualneighborswithapositivelinktov 2. N -Thenumberofmutualneighborswithanegativelinktov 3. P (e (u,v) =1 N ± )whichdenotestheprobabilityofedge(u, v) tobepositive based on their mutual neighbors. To further explain the feature setup, a small network with one unknown label is shown in Figure 3.1. In this small network, the features for edge(u,v) in the first category are 1. d + in (v) =1,d in (v) =4 2. d + out(u) =0,d out (u) =1 3. d(u) =3andd(v) =6 4. C(u, v) =2 5. H(u, v) = 1since (v) = 3 < (u) = 1 6. T (v) isgottenfromthepapertitleornameofthedatasourcerepresentedby v. The features for edge(u,v) in the second category are 1. N + =0 2. N =2 3. P (e (u,v) =1 N ± )=0

35 35 Figure 3.1: Example of a link label prediction problem with one unknown label. Getting the features, we pass them to SVM classifier to classify the edges into positive and negative edges using the RBF kernel (See section ). We used the LibSVM implementation of SVM for the classification. Since the network is made up many lowly cited nodes (See Figure 4.2), we sample the test dataset (edges) from nodes with high in-degree to avoid a situation where we get dangling nodes by randomly sampling all the edges linking (in an undirected sense) a node to any other node in the network. Another reason for this is to keep a considerable amount of links to each node for better predictions. For instance, in Figure 3.1, using only edge (d, v) inthetrainingdatasetmightleadtothepredictionofpositivelabelsonother incoming edges to v in the test dataset. This is because a positively signed edge (d, v) would be the only incoming edge to node v in the training dataset. Thus, for each node with a high in-degree, we randomly sample 10% of the incoming links. The parameter was selected by running a grid search cross-validation on the training data and then selecting the parameters that achieved the best result. The evaluations and results will be shown and discussed in Chapter 4.

36 Support Vector Machines (SVM) Support Vector Machines (SVM) was introduced in 1979 (Vapnik, 1982). This algorithm has been shown to support both classification and regression tasks and has thus grown in popularity over the years. The basic concept of this algorithm is as follows: Given an input vector, find a hyperplane (boundary) that separates a set of objects having di erent class memberships. This raises the question of how do we find a good hyperplane that not only can separate the training data, but, also will generalize well - as not only all hyperplanes that separate the training data will automatically separate the testing data [42]. The optimal hyperplane for this work can be defined as a decision function with maximal margin between positively signed vectors and negatively signed vectors. Figure 3.2: Example of a linear separating hyperplane of a separable problem in a 2D space. Support vectors (circled) show the margin of the maximal separation between the two classes.

37 37 Optimal Hyperplane Suppose we have a set of labeled training examples: (x 1,y 1 ), (x 2,y 2 ),...,(x n,y n ), y i 2 {1, 1} (3.1) In our case x can be seen as link features and y as the link label. This training examples are linearly separable if there exists a vector w and a scalar b such that the inequalities: w x i + b 1 if y i =1, (3.2) w x i + b apple 1 if y i = 1, (3.3) are valid for all examples of the training set [42]. These inequalities can be combined into one set of inequality: y i (x i w + b) 1 0 8i (3.4) Lying on the hyperplane H 1 : w x i + b = 1, are the points for which the equality (3.3) holds. These points have a perpendicular distance w x+b = 1 from the optimal kwk kwk hyperplane with normal w. Similarly, lying on the hyperplane H 2 : w x i + b =1, are the points for which the equality (3.2) holds. These points have a perpendicular distance w x+b kwk = 1 kwk from the optimal hyperplane with normal w. Therefore, d = d + = 1 kwk and the margin is 2 kwk. Therefore, the optimal hyperplane w? x + b? =0 can be obtained by minimizing k w k 2 subject to the constraints (3.4). Examples x i for which y i (w x i + b) =1arecalledthesupportvectors. Ifthe optimal hyperplane can be built from few of its support vectors relative to the size of the training dataset, the generalization ability will be high even in an infinite

38 38 dimensional space [42]. However, [14] shows that w can be expressed as a linear combination of a subset of the training dataset: The points that lie exactly on the margin (i.e. minimum distance to the hyperplane). w? = l i=1y i? i x i, (3.5) where? i 0. Since > 0onlyforsupportvectors,theequation(3.5)representsa concise form of writing w?. To handle non-separable noisy examples between H1 and H2, the constrains (3.2) and (3.3) can to be relaxed when necessary (i.e. introduce further cost for doing this) by introducing positive slack variables i,i=1, 2,...,n [42]. The constraints will now be: w x i + b 1 i if y i =1, i 0 8i. (3.6) w x i + b apple 1+ i if y i = 1, i 0 8i. (3.7) One way to assign an extra penalty for errors is to change the objective function to be minimized from kwk2 2 to kwk2 2 + C( i i ) k where a larger C correlate to assigning a higher penalty to errors. Non-Linearly separable training A question that may arise is what if they are not linearly separable, how can the above methods be generalized? Boser et al. [43] showed that this problem can be solved by mapping the data into a new dimension feature space where it can be separated (see Figure 3.3). For this, we need to find a function that will perform such mapping: : < N! F

39 39 However, this feature space might be of a much higher dimension and mapping the data into a feature space with such dimension might a ect the performance of the resulting machine. One can show [44] that during training, the optimization problem only uses the training examples to compute pair-wise dot products, x i x j, where x i,x j 2< N. This is notable because it turns out that there exist functions that, given two vectors x and y in < N, implicitly computes the dot product between x and y in a higher-dimensional space < M without explicitly transforming x and y to < M. This process uses no extra memory and has a minimal e ect on computation time. These functions are called kernel functions. A kernel function can thus be formulated as: K(x i,x j )=( (x i ) (x j )) M, x i,x j 2< N (3.8) where, (, ) M is an inner product of < M, (x) transformsx to < M ( :< N!< M )and M>N. The most popular kernel functions K(x i,x j )= (x i ). (x j ) avalible in most o -the-shelf classifiers are 8 >< K(x i,x j )= (x i ) (x j )= >: x i x j Linear ( x i x j + C) d Polynomial exp( x i x j 2 ) RBF tanh( x i x j + C) Sigmoid 9 >= >; The most popular choice of kernel types used in Support Vector Machines is the RBF kernel. This is primarily because of their finite and localized responses across the entire range of the real x-axis [45].

40 40 Figure 3.3: A non-linearly separable hyperplane in a 2D space mapped into a new feature space where data are linearly separable. 3.3 Label Propagation Method Based on an assumption that If a publication is citing a popular database, the link relationship is most likely to be dataset relevant (positive), a semi-supervised approach to our link label prediction problem is to think of it as a label propagation problem. Note, popular in our case means having most of its incoming links to be positive. Given a graph G =(V,E) where nodes V =1,...,n represent papers and dataset sources, edges E are partially labeled (i.e only some edges are labeled), we want to infer the labels of the unlabeled edges using the labeled edges and the topological proximity relationship of the graph. A broad description of how label propagation works can be found in section In our case, in order to use a label propagation approach, we need a way to convert our directed graph G =(V,E) toadirectedgraphg? =(V?,E? )suchthat the edges are represented as labeled and unlabeled nodes V? and edges E? represents similarities between them. These similarities are given by a weight matrix W : W i,j is non-zero i e i and e j have similar target nodes or the target node of one is the source node of another, i.e., the edges have the following configurations ({e i,j and e k,j } or {e i,j and e j,k }). We can then propagate the labels along the directed edges according

41 41 to the weight of the links in a random walk approach. To convert our graph G =(V,E) withn number of edges to G? =(V?,E? ) with N number of nodes, we define a conversion algorithm (See Algorithm 1). This algorithm creates a new graph G? with the edges of G as its nodes and generates edges in such a way that two nodes in G? are linked only if their edges in G have the same target node V j or the source node of one is the target node of the other. Two nodes in G? are linked together with either a similarity score of two if the edges in G represented by these nodes have the same target node, or a similarity score of one if the source node of one is the target node of the other. Figure 3.1 shows an example of a given graph G converted to graph G?. Algorithm 1 Graph Convertion Algorithm Input: Graph G(V,E) Output: Graph G? =(V?,E? )andweightsw? V i? = e i for i =1,...,N foreach V i? do Create and add to E? edges connecting V i? (e i )toalltheincomingandoutgoing edges V j? (e j )ofthetargetnodeofedgee i if Connecting to an incoming edge then Assign a weight W i,j? =2 else Assign a weight W i,j? =1 end end If our graph was a connected graph, using only the graph produced as output by Algorithm 1 would have been enough. However, our graph has more than one connected component. For this, we introduced an adjustment: Whenever there is adeadend,wemakearandomjumptoanyotherconnectedcomponent. Thisis implemented by introducing a random edge to a node in each connected component CC i in G from a node in another connected component CC j. These edges are given no label and are generated based on a probability distribution. The probability

42 42 (a) Original Graph G (b) Converted Graph G? Figure 3.4: A sample graph G. (a) Original graph G given as input to the algorithm, (b) Resulting graph G? produced as output of the algorithm with weights assigned to its edges. distribution is made such that the random edge is more likely to be generated from a node with high in-degree. This preference is made so that the result of the label propagation algorithm is not a ected by the generated edges. Based on the label propagation algorithm (See Algorithm 2), starting with labeled nodes 1, 2,...,l and unlabeled nodes l +1,...,n, each node starts to propagate its label to its neighbors in the manner of a Markov chain on the graph, and the process is repeated until convergence [46] Label propagation Theory Label propagation is a popular graph-based semi-supervised learning framework. The problem setup is defined as follows: denoting a set of items and their corresponding labels (x 1,y 1 )...(x n,y n ),let X = {x 1...x n } be the set of items and Y = {y 1...y n } the item labels. Let (x 1,y 1 )...(x l,y l )betheitemswithknownlabelsy L = y 1...y l,and (x l+1,y l+1 )...(x n,y n )betheitemswithunknownlabelsy UL = y l+1...y n. The aim is to predict the labels of unlabeled items from X and Y L. To achieve this goal, a graph G =(V,E)isconstructedwhere;thesetofnodesV = x 1...x n is the set of items X and

43 43 the weights of the set of edges E shows the similarities among the items. Intuitively, similar nodes should have similar labels. Label from one node is propagated to other nodes based on the weight of the edges (i.e. labels are more likely to be propagated through edges with larger weights). Di erent algorithms have been proposed for label propagation. These include iterative algorithms [25][24] and using Markov random walks [26]. Iterative Algorithms Given the graph G, the idea is to propagate labels from labeled nodes to unlabeled nodes in the graph. Starting with labeled nodes (x 1,y 1 )...(x l,y l )labeledwith(1or 1) to unlabeled nodes (x l+1,y l+1 )...(x n,y n )labeledwith0,eachnodestartstopropagate its label to its neighbors, and the process is repeated until convergence. Algorithms of this kind have been proposed by Zhu et al. and Zhou et al. [25][24] (See Algorithm: (2) and (3)). Labels on labeled and unlabeled data are denoted by Ŷ =(Ŷl, Ŷu). Zhu et al. Proposed and proved the convergence of the propagation algorithm (See Algorithm: (2)), the initial labels for the data points with known labels (x 1...x l )areforcedontheirestimatedlabelsi.e. Ŷ l is constrained to be equal to Y l. However, another similar label propagation algorithm (See Algorithm: (3)) known as label spreading was proposed and proved to converge by Zhou et al. [25] which at each step, a node i gets a contribution from its neighbors j (weighted by the normalized weight of the edge (i, j)) and an additional small contribution given by its initial value [46]. In general, we can expect convergence rate of these two algorithms to be at worst on the order of O(kn 2 ), where k is the number of neighbors of a point in the graph. In the case of a dense weight matrix, the computational time is thus O(n 3 )[46].

44 44 Algorithm 2 Label Propagation, Zhu et al Input: Graph G(V,E), labels Y l Output: Labels Ŷ Compute D ii = j A ij Compute P = D 1 A Initialize Y 0 =(Y l, 0),t=0 while Y t hasn t converged do end Ŷ Y t+1 PY t Y t+1 l Yl t Y t Algorithm 3 Label Spreading, Zhou et al Input: Graph G(V,E), labels Y l Output: Labels Ŷ Compute D ii = j A ij Compute S = D 1 2 AD 1 2 Initialize Y 0 =(Y l, 0),t=0 Choose a parameter 2 [0, 1) while Y t hasn t converged do Y t+1 SY t +(1 )Y 0 t t +1 end

45 45 Markov Random Walks Szummer et al. [26] took a di erent approach for label propagation on a similarity graph by implementing a Markov random walks approach. To estimate probabilities of class labels, they defined the transition probabilities of Markov random walks on the graph from i to k as: p i,k = W i,k j W i,j, (3.9) Where, the weight W i,j is given by a Gaussian kernel for neighbors and 0 for nonneighbors, and W i,i =1(butonecouldalsouseW i,i =0)[46]. Theyassumethatthe starting point of the Markov random walk is chosen uniformly at random, i.e.,p (i) = 1. A probability P (y =1 i) ofbeingofclass1isassociatedwitheachdatapointx N i. The probability P t (y start =1 j) ofagivenapointx j (i.e. we started from a point of class y start =1),giventhatwearrivedtox j after t steps of random walk is given by: P t (y start =1 j) = n i=1p (y =1 i)p 0 t (i j), (3.10) where P 0 t (i j) istheprobabilitythatwestartedfromx i given that we arrived at j after t steps of random walk (this probability can be computed from the p i,k )[26]. If P t (y start =1 j) > 0.5, x j is then classified to 1, otherwise it is classified to 1. They proposed two techniques for estimating the unknown parameters P (y i): maximum likelihood with expectation-maximization (EM), and maximum margin subject to constraints. However, this algorithm s performance is largely dependent on the length of the random walk t. This parameter can be chosen heuristically (i.e., to the scale of the clusters we are interested in) or by cross-validation (if enough data are available) [46].

46 PageRank-based Method Here we consider another semi-supervised approach for predicting the missing link labels in our dataset. In this method, we approach our link label prediction problem as a task to rank the nodes in the network. The idea is to design an algorithm that will assign higher scores to nodes with more positive in-degree than those with more negative in-degree. Here, we rank the nodes using the PageRank algorithm, then we find a threshold separating the nodes such that an edge directed towards any of the nodes with a rank score above the threshold would be labeled as a positive edge and an edge directed towards a node below the threshold would be labeled as a negative edge. A broad description of how PageRank works can be found in section below. We define an approach: Given a graph G, we construct a Markov model that represents the graph as a sparse square matrix M whose element M u,v is the probability of moving from node u to node v in one-time step. For instance, using the graph in Figure (3.6), then our M would be the matrix P (See Matrix 3.4.1). We then compute the adjustments to make our matrix irreducible and stochastic (See Section 3.4.1). The PageRank scores are then calculated using a modified version of the quadratic extrapolation algorithm. The quadratic extrapolation algorithm accelerates the convergence of the power method [47]. We modified the algorithm such that: 8 >< r i = j2li >: r j N j r j N j if edge (P j,p i )ispositive, if edge (P j,p i )isnegative, (3.11) Equation 3.11 is set such that incoming negative links will decrease the rank score of a node while incoming positive links will increase the rank score of a node. An initial rank score of 1 N was assigned to each node (N is the total number of nodes). Figure

47 47 (3.5) shows the PageRank score distribution of both the original PageRank algorithm and the modified PageRank algorithm. Another important thing in this approach is the threshold. The performance of this approach relies on a good choice of threshold. For the our network, we chose the 85th percentile to be the threshold PageRank Theory The PageRank algorithm which was first introduced by Brin and page [20] is one of the algorithms used by Google to rank their search engine results. The main idea behind the PageRank algorithm is that the importance of any web page can be determined based on the pages that link to it. It was intuitively explained using the concept of an infinitely dedicated web surfer randomly going from page to page by choosing arandomlinkfromonepagetogettoanother. ThePageRankofpage(i) isthe probability that a random web surfer visits page(i). However, this random surfer can end up in a loop (cycles around cliques of interconnected pages) or hit a dead end (a web page with no outgoing links). Therefore, in order to address these problems, an adjustment was made such that with a certain probability, the random surfer jumps to a random web page. This random walk is known as a Markov process. Based on the principle of PageRank, if we include a hyperlink to the web page(i)inourdatamining lab site, this means that we consider page(i) important and relevant to the topic being discussed on our site. If lots of other web pages link also to page(i), the logical belief is that page(i) isimportant(containssomeimportantnewsorinformation). However, if page(i) hasonlyonebacklink,whichcomesfromaverypopularsitepage(j), (like ngtonpost.com, we say that page(j) asserts thatpage(i) is important. With this understanding, we can say that PageRank algorithm is like a counter of an online ballot, where pages vote for the importance

48 Scores Nodes 10 4 (a) Original PageRank Scores Nodes 10 4 (b) Modified PageRank Figure 3.5: PageRank score. (a) Resulting rank scores of Normal PageRank algorithm, (b) Resulting rank scores of our method, the red line shows the threshold

49 49 of other pages, and this result is then gathered by PageRank and is reflected in the search results. For any given graph, PageRank is well-defined and can be used for capturing quantitative correlations between pairs of vertices as well as pairs of subsets of vertices. The Algorithm The original PageRank algorithm described by Brin et al. [20] is given by: PR(A) =(1 PR(T1 ) d)+d C(T 1 ) PR(T n) C(T n ) (3.12) where, PR(A) isthepagerankofpagea, PR(T i )isthepagerankofpagest i which link to page A, C(T i )-numberofoutboundlinksonpaget i,andd-dampingfactor which can be set between 0 and 1. The damping factor d is usually set to be This means the amount of PageRank that a page has to vote will be its own value 0.85 and this vote is shared equally among all the pages it links to. To get accurate results, this process performed iteratively till it converges. The PageRank r i,i2 1, 2,...,nfor page P i,i 2 1, 2,...,n can be recursively defined as: r i = j2li r j N j, i =1, 2,..., n. (3.13) where N j - the number of outlinks from page P j and L i -thepagesthatlinktopage P j. Operation On Matrix Let us consider Figure (3.6) which is a graphical representation of a small web consisting of six web pages. The graph contains six nodes, each of which represents a web

50 page with the link showing the relationship between the web pages. Here, a hyperlink Figure 3.6: A Directed graph representing a small web of six web pages.

50 50 page with the link showing the relationship between the web pages. Here, a hyperlink Figure 3.6: A Directed graph representing a small web of six web pages. matrix P is introduced, which is a row normalized hyperlink matrix with P i,j = 1 P i if, node(i) hasalinktonode(j) and0otherwise. Ateachiteration,aPageRanksingle vector T containing all the page rank values is calculated: (k+1)t = (k)t P (3.14) where is the page rank vector and is usually initialized with i = 1 N for a web of N pages.the matrix P of the small web in Figure (3.6) is given by: 0 1 P 1 P 2 P 3 P 4 P 5 P 6 P P P = P P BP A 1 1 P However, using just the hyperlink structure in building the Markov matrix is not enough because of the presence of dangling node. Dangling nodes are nodes with no

51 51 outgoing edges like P1 in our example thus, P is not stochastic. These dangling nodes appear very often on the web. There are two proposed adjustments to deal with this problem. The first adjustment is to make the matrix stochastic. This is formulated mathematically as: S = P + a( 1 N et ) (3.15) where 8 >< 1 if P i is a dangling node, a i = >: 0 otherwise With this adjustment, our matrix will look like this: 0 1 P 1 P 2 P 3 P 4 P 5 P P P S = P P BP A 1 1 P However, we need to make P irreducible so as to insure the existence of the stationary vector of the chain (the PageRank vector) [48]. Therefore, to make P irreducible, another adjustment is made: G = S +(1 )E (3.16) Where is a scalar between 0 and 1 and E = 1 N eet is the teleportation matrix. =0.85 can be explained as the random surfer follows the hyperlink structure of the web 85% of the time and jumps to a random new page 15% of the time.

52 52 Chapter 4 Experimental Results In this chapter, we present and discuss our dataset, the results of experiments, compare the proposed approaches, evaluate their predictive performance on our dataset, and also analyze the community structure of dataset relevant citations. 4.1 Data Collection The dataset used in this thesis comprises papers presented in three major conferences: International Conference on Data Mining IEEE (ICDM), SIAM International Conference on Data Mining (SDM), and ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD). The dataset spans from 2001 until 2014 and was crawled from the DBLP computer science repository website uni-trier.de/db/conf/icdm/index.html, sdm/index.html, andhttp://dblp.uni-trier.de/db/conf/kdd/index.html. The main information in the collected dataset includes paper title, authors, abstract and reference. Figure 4.1 shows the number of papers collected for each conference from 2001 to However, to use this dataset e ciently in our work, we split the papers and their references, and present them as separate entries thus, building an index over the whole dataset. For each paper, we create a node pair (u, v i ) i =1, 2,...,n.

53 53 Where u is the paper, v i is a reference in the paper, and n is the total number of references in the paper. (a) International Conference on Data Mining IEEE (ICDM) (b) SIAM International Conference on Data Mining (SDM) (c) ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD) Figure 4.1: The number of collected papers for each of the three conferences from 2001 to Dataset description The obtained dataset is presented as a citation network with nodes representing papers and their references and directed links representing the citation relationship. In total, there are 4830 papers gotten from the three conferences and on average, each paper references a dataset source, although there are papers with more than one reference to dataset sources and also papers with no reference to any dataset source. Each link is labeled as positive or negative based on the nature of the reference. If the

54 54 Table 4.1: Driected Network Information Number of Nodes Number of Edges Connected components 63 Number of Nodes with out-links 4830 reference is based on the dataset used in the paper it is labeled as positive, otherwise, it is labeled as negative. The labels were obtained by a manual process of browsing through the papers to check for which of their references are dataset related. The network contains 51,680 nodes and 101,503 edges, of which 90% are negative. The full citation network is shown in Figure 4.3. We can see that the network is made up of one large connected component and a number of smaller connected components. In total, there are 63 connected components. Information about the network can be seen in Table Number of citations Papers and data sources (index) 10 4 Figure 4.2: Distribution of the number of citation The number of citations of each node in our network (Figure 4.3) can be seen in

55 Figure 4.3: Full citation graph Figure 4.2 with the most cited paper in our dataset being Latent Dirichlet Allocation - Blei et al.. The nodes in Figure 4.

55 55 Figure 4.3: Full citation graph Figure 4.2 with the most cited paper in our dataset being Latent Dirichlet Allocation - Blei et al.. The nodes in Figure 4.2 represented by their indices are sorted in descending order according to their citation count in our dataset. We can see in 4.2 that the dataset is mainly made of lowly-cited papers and dataset sources. The highly-cited dataset sources (with positive in-degree greater than 20 and based on di erent citations) in our dataset listed according to their citation count are: 1. UCI Repository of Machine Learning Databases - Blake and Merz 2. UCI Repository of Machine Learning Databases - Asuncion and Newman 3. UCI Repository of Machine Learning Databases - frank and Asuncion 4. The UCI KDD Archive - S.D Bay 5. Grouplens: An Open Architecture For Collaborative Filtering of Netnews - Resnick et al.

56 56 6. Gradient-based Learning Applied to Document Recognition - Lecun et al. The UCI Repository of Machine learning Databases [49] is a machine learning dataset repository maintained by the University of California, Irvine. This repository which was created in 1987 currently comprises 348 datasets that encompass a wide variety of application areas and data types from a broad range of problem areas including medicine, engineering, politics, finance, molecular biology, etc. The UCI KDD Archive [50] published in 1999 is similar to the UCI Repository of Machine Learning Databases. However, its aim is to store large datasets that can challenge existing algorithms in terms of scaling as a function of the number of individual samples (or examples, or objects). This dataset also spans a wide variety of problem tasks and data types. GroupLens [51] is a research paper published in It introduced a system integrated into the Usernet news architecture. It is a distributed system for gathering, distributing, and using ratings from some users to predict other users interest in articles. The news clients can be freely substituted, providing an environment for experimentation in predicting ratings and in user interfaces for collecting ratings and presenting predictions [51]. Gradient-based Learning Applied to Document Recognition is a paper on neural networks and handwriting recognition. In this paper, they created a new database which was called the Modified NIST dataset or MNIST. This database composes handwritten digit images with over images. They also collected datasets of over 500,000 hand printed characters (comprising of upper cases, lower cases, digits, and punctuations) spanning the entire printable ASCII set, and over 3000 words.

57 4.3 Results 57 The network that we study is overwhelmingly unbalanced with most of the nodes having very low in-degree. Thus, we consider and evaluate the three di erent algorithms by randomly sampling 10% of edges to each node with high in-degree as test datasets. The evaluation results are obtained from running the evaluation five times with random samples and then taking the average. However, since our aim is to make predictions for new papers, we set aside 300 papers as unseen data to be used to verify the performance of the algorithms. We measure the performance of our methods based on the precision, recall, and AUC (area under the ROC curve) of the algorithms and also show their ROC curve. For the SVM method, we evaluate di erent 5 di erent approaches with di erent feature sets 1. SVM method with all 12 features, see Section (noted as Degree & Embedded in Figure 4.4 ) 2. SVM method with 9 features from node degree information (noted as Degree in Figure 4.4 ) 3. SVM method with 3 features from mutual neighbors (noted as Embedded in Figure 4.4 ) 4. SVM method with 28 features with our 9 degree features and 16 structural balance features as presented in [7] (noted as Degree & 16 Triads in Figure 4.4 ) 5. SVM method with only 16 structural balance features as presented in [7] (noted as 16 Triads in Figure 4.4 ) In addition, we have the

58 58 1. Label Propagation method (noted as Propagation in Figure 4.4 ) 2. PageRank method (noted as PageRank in Figure 4.4 ) The results shown in Figure 4.4 compare the AUC, precision, and recall of these di erent approaches. We can see that using just triad information or information obtained from mutual neighbors produce worse results since as noted by [7], the triad features are only relevant when two nodes have neighbors in common. It is to be expected that they will be most e ective for edges of greater embeddedness. However, the network we study has very few of such edges with high embeddedness. The ROC curve is shown in Figure 4.5. It compares the 7 di erent approaches. We can see that the methods relying on the embeddedness of the network performed worse than the rest. The propagation method in general performed well while the PageRank method performed worse than the SVM methods. However, in this thesis work, our aim is to obtain a high number of true-positives with low false-positives; hence, the region of the curves corresponding to low false-positive rates is of primary interest. In Figure 4.5, the SVM method based on the degree and structural balance features performs better than the other methods if a false-positive threshold of 0.1 is selected. However, the PageRank method performs less than both the label propagation and SVM methods in the region of the curves corresponding to low false-positive rates as well as the full curves. To evaluate the performance of our algorithms on the unseen papers, we use all the papers in the training and testing set for training. We then try to predict the dataset-relevant links in the unseen dataset. Figure 4.6 shows the result of the prediction using the di erent algorithms. For SVM, we report the best result from using the di erent feature sets. This is interesting as it shows how the di erent methods respond to unseen datasets. We can see that the label propagation method

59 59 (a) Average AUC (b) Average Precision (c) Average Recall Figure 4.4: Performance Results of the evaluated approaches

60 True positive rate False positive rate Embedded 16 Triads Degree & 16 Triads Degree & Embedded Degree Propagation PageRank Figure 4.5: ROC curve of the di erent approaches. Figure 4.6: Result of the evaluation on unseen papers out-performs the other methods, even with the unseen datasets as it was able to predict more dataset-relevant links with lesser error.

61 Community Network Analysis In many networks, communities can be found as groups of vertices. The connections among vertices within a community are denser than that from one community to another. Various algorithms have in recent years been developed for detecting this structure. In our work, we are interested in analyzing the citation structure of datareference papers. Analyzing the dataset communities can enable us to gain further information on the datasets such as how the datasets or dataset sources a ect the structure of the network, which datasets are more useful for a particular topic, what are the prevalent topics within the communities, which people are more likely to use adatasetetc. WeusedtheGLaycommunitydetectionalgorithmbySuetal.[52] implemented in a tool for community detection in Cytoscape clusterm aker [53]. Table 4.2: Dataset Network Information Number of Nodes 5701 Number of Edges 4627 Connected components 1265 Taking out all the non-dataset related links from our network, we got a new network with only dataset relevant links. Information about the new network can be seen in Table 4.2. Running the community detection algorithm and selecting communities with more than five nodes in order to focus on fairly large communities, we got a total of 128 Communities. The three largest communities are shown in Figure 4.9, 4.10, and We can see that the first and second largest communities Figure 4.9 and 4.10, are largely made up of papers citing the UCI dataset repository. Although these papers might be using similar datasets, the dataset repository is cited di erently according to the UCI repository librarians in charge of the repository at the time of use. The UCI dataset repository is one of the most popular repositories used by the machine learning

62 62 community and comprises datasets of di erent types and varieties. The third largest community Figure 4.11 consists of two main dataset sources which were respectively introduced in two research papers: RCV1: A New Benchmark Collection for Text Categorization Research - Lewis et al. which introduced a dataset repository Reuters Corpus Volume I (RCV1), an archive of over 800,000 manually categorized Newswire stories made available by Reuters, Ltd. for research purposes. NewsWeeder: Learning to Filter Netnews - Lang et al. introduced NewsWeeder anetnews-filteringsystemwhichcollectsuserratingsononlinenewsarticles. The collected rating information is then used to learn a new model of the users interests. In addition to the three largest communities, two other communities shown in Figure 4.12 and 4.13 are analyzed. The dataset sources in the fourth community are mainly based on the use and behavior of online articles. These datasets were introduced and discussed in the following research papers GroupLens: An Open Architecture for Collaborative Filtering of Netnews - Resnick et al. introduced a system GroupLens. This system makes article predictions based on ratings obtained from Usernet news users. GroupLens: Applying Collaborative Filtering to Usenet News - Konstan et al. is a further discussion on the GroupLens system. Cascading Behavior in Large Blog Graphs - Leskovec et al. extracted 2,422,704 posts from 44,362 blogs from a larger dataset [54] which contains 21.3 million posts from 2.5 million blogs from August and September 2005.

63 63 The fifth community mainly comprises citations on time-series dataset sources. Although the UCR time series repository is cited in di erent formats, some of the papers having di erent citation formats might be using similar datasets. This is due to the citation policy of the repository (similar to UCI dataset repository), which requires the citation to be on the repository s librarians at the time of use. The main dataset sources (according to the di erent citation formats) in this community are The UCR Time Series Data Mining Archive - Keogh and Folias The UCR Time Series Classification/Clustering Homepage - Keogh, Xi, Wei, and Ratanamahatana - Keogh Emergence of Scaling in Random Networks - Barabasi et al. evaluated their work on the electrical power grid data of western US with 4,941 vertices, collaboration graph of movie actors with 212,250 vertices and the world wide web network with 325,729 vertices. Exact Discovery of Time Series Motifs - Abdullah et al. built and made publicly available a web page containing all the time series dataset and code used in their work. To further analyze these communities, we first look into the research topics of the papers in each community. A total of 20,542 topics are crawled from the Microsoft academic website [55] in the field of machine learning and data mining, because papers studied in this thesis are from conferences in these two areas. We then calculate the frequency of each topic in the titles of the papers represented by the nodes in a given community. The popular topics in each community are the frequent ones (with

64 frequency varying from community to community ). Figure 4.7 and 4.8 shows the topic clouds of the three largest communities and the two selected communities.

7: Topic Cloud of the four largest communities The largest community (Community 1) has three main sources of datasets: UCI dataset repository, UCI KDD archive, and

WebAce was introduced in a research paper - WebACE: A Web Agent for Document Categorization and Exploration - Han et al.

64 64 frequency varying from community to community ). Figure 4.7 and 4.8 shows the topic clouds of the three largest communities and the two selected communities. (a) Community 1 (b) Community 2 (c) Community 3 Figure 4.7: Topic Cloud of the four largest communities The largest community (Community 1) has three main sources of datasets: UCI dataset repository, UCI KDD archive, and WebACE an agent for exploring and categorizing documents of the world wide web. WebAce was introduced in a research paper - WebACE: A Web Agent for Document Categorization and Exploration - Han et al. The largest dataset source in this community is the UCI dataset repository. This is expected due to the vast amount of dataset in the repository. However, due to the repository citation policy of both UCI dataset repository and UCI KDD archive, the exact type of dataset used in the papers cannot be easily determined by looking at the repositories (or the citations). The topic cloud confirms that the repositories

This community s topic cloud also confirms the wide range of dataset and application area of the UCI dataset repository.

65 65 (a) Community 4 (b) Community 5 Figure 4.8: Topic Cloud of the two selected communities span a wide variety of data types and application areas. The second largest dataset community (Community 2) has one main dataset source which is the UCI dataset repository - A. Asuncion and D. Newman. This community s topic cloud also confirms the wide range of dataset and application area of the UCI dataset repository. The third largest dataset community (Community 3) has four main dataset sources introduced in RCV1: A New Benchmark Collection for Text Categorization Research - Lewis et al., NewsWeeder: Learning to Filter Netnews - Lang et al., Gradient-based Learning Applied to Document Recognition - Lecun et al., and Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring - Golub et al. respectively. We can see (based on the description of the dataset sources above) that three of the largely cited dataset sources in this community are relevant to text (or character) recognition and classification. Thus, the larger amount of papers on text related topics. However, due to the fourth dataset source, the community comprises of a fair amount of papers on gene expression. The fourth community (Community 4) has two main dataset sources introduced

66 66 and discussed in three research papers: GroupLens: An Open Architecture for Collaborative Filtering of Netnews - Resnick et al., Cascading Behavior in Large Blog Graphs - Leskovec et al., and Grouplens: Applying Collaborative Filtering to Usenet News - Konstan et al. We can see in Figure 4.8 that this community mainly has papers in collaboratory filtering. This is because the main dataset source in this community is the GroupLens system which uses collaborative filtering to predict (and recommend) Usernet news articles based on obtained user ratings. The second dataset source, however, is a collection of posts and articles from blogs. The fifth community (Community 5) has three main dataset sources. The first data source is the UCR time series data mining archive which is a repository of time series datasets. The second dataset source introduced in Exact Discovery of Time Series Motifs - Abdullah et al. is also made up of time series datasets while the third dataset source contains datasets from di erent networks. Two of the three main dataset sources in this community consists of time series datasets. Thus, the high number of time series papers in this community. To further analyze the predicted dataset relevant links in the unseen data, we ran the community-detection process with the correctly predicted links (using the label propagation method) included, and analyzed the distribution of the links in the communities. 68 of the 109 predicted links were distributed in the 15 largest (with more than 53 papers) communities with 19 of the 68 links in the largest community.

67 67 Figure 4.9: The largest community. (a) UCI dataset repository - C.L. Blake and C.J. Merz, (b) UCI KDD archive - S.D. Bay (c) WebACE: A Web Agent for Document Categorization and Exploration - Han et al

(b) NewsWeeder: Learning to Filter Netnews - Lang et al.

68 68 Figure 4.10: The second largest community. (a) UCI dataset repository - A. Asuncion and D. Newman Figure 4.11: The third largest community. (a) RCV1: A New Benchmark Collection for Text Categorization Research - Lewis et al. (b) NewsWeeder: Learning to Filter Netnews - Lang et al. (c) Gradient-based Learning Applied to Document Recognition -Lecunetal. (d)molecularclassificationofcancer: ClassDiscoveryandClass Prediction by Gene Expression Monitoring - Golub et al

69 69 Figure 4.12: The fourth community. (a) GroupLens: An Open Architecture for Collaborative Filtering of Netnews - Resnick et al. (b) Cascading Behavior in Large Blog Graphs - Leskovec et al. (c) Grouplens: Applying Collaborative Filtering to Usenet News - Konstan et al.

Link Sign Prediction and Ranking in Signed Directed Social Networks

Noname manuscript No. (will be inserted by the editor) Link Sign Prediction and Ranking in Signed Directed Social Networks Dongjin Song David A. Meyer Received: date / Accepted: date Abstract Signed directed