Link Prediction for Social Network

Size: px

Start display at page:

Download "Link Prediction for Social Network"

Dustin Williams
5 years ago
Views:

1 Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Abstract Friendship recommendation has become an important issue for social network in the digital era. In this paper, I model the friendship recommendation problem to the link prediction problem of a graph. Based on supervised learning methods, I select several useful features and classification models to achieve accurate friendship recommendation. Experimental results show that I achieve high accuracy, precision, recall and F-measure rate. Keywords link prediction, social network, Support Vector Machine, Logistic Regression. I. INTRODUCTION AND LITERATURE Social Network has gained tremendous popularity in recent years. People communicate with each other through social networks more and more frequently. Therefore, for anyone who manages a social network, it is an important task to correctly predict the relationship or proximity of arbitrary pairs of users through a program; thus can provide a better using experience for customers. In this paper, I build a reliable friendship recommendation system. There are two types of datasets in friendship prediction: homogeneous and heterogeneous network. For homogeneous networks, there are only topological information. Whereas for heterogeneous networks, some extra side information are given. For example, we might have some features describing the friendship connection, such as how many posts the two users are together tagged in and the frequency the two users chat with each other, etc. According to [1], there are three challenges in friendship prediction: (1) in heterogeneous network, it is not obvious how best to combine the topology and side information, (2) the extreme imbalance of link prediction datasets, that is, the number of edges known to be present if often far less than the number of edges known to be absent, and (3) the large scale of social network leads to computational inefficiency. There are many methods for link predictions: (1) methods based on node neighbors, (2) methods based on the ensembles of all paths, and (3) some higher level approaches such as lowrank approximation, unseen bigrams, and clustering. In this paper I aim to solve link prediction using [2] as dataset, which is a homogeneous network collected from Facebook with incomplete temporal information. The dataset is often used to do link prediction research. The state-of-the-art methods to solve this problem include matrix factorization [1] and tensor factorization [3]. However, scalability is a big concern of both feature-based (discussed in this assignment) and kernel-based method for most of the real-life social networks. Cold start is also problem. Most of the research move on to heterogeneous network, which has additional information related to users that will help address the above problems. The rest of this paper is organized as follows. Section II describes the dataset. Section III selection of features and models in details. I compare the performance between different models and provide discussion in Section IV. Finally, Section V concludes the paper. II. DATASET The dataset is a temporal undirected social network which contains friendship data of Facebook users. The network has nodes (i.e., users) and edges (i.e., links). In average, every user has friendship connections. The user with maximum friends has 1098 connections. Figure 2 shows the degree distributions of the graph. Fig. 1. Degree Distributions The temporal information is the UNIX timestamp with the time of link establishment. However, the temporal information is incomplete. Over half of the edges (481327) are lack of temporal information and thus their timestamp are treated as zero. The rest of the edges were connected started from to Figure 3 shows the temporal distribution of the graph. In order to conduct experiments, I sort the edges in terms of temporal information and remove the last 1% edges from the graphs as testing set, and the remaining edges are the training set. The following table shows the overview of the dataset.

2 common friends the two nodes share, the higher probability the two nodes being connected in the future. I count the number of neighbors that two nodes, x and y, have in common and denote it to be Γ(x) Γ(y) (1) where Γ(x) is the set of neighbors of x. 2) Jaccards Coefficient: Based on the idea of common neighbor, I further deliberate that it is more likely that there exist a link between the two nodes x and y if they share more common neighbors and have fewer neighbors respectively. Here, I select the Jaccards coefficient, also known as the Jaccard index, as an indicator to predict a link. Fig. 2. Temporal Distribution Number of Edges Number of Nodes whole data set training data testing data Since the cold start problem is an important issue for recommendation system, I calculate how many users in the testing data that is new and unseen to training data. This is a measurement of how serious the cold start problem is. The result shows that 481 users are new to the training data. III. PREDICTIVE TASK AND MODEL SELECTION As shown in figure 1, given a partial undirected graph Gp = (V, E) and two nodes u,v Gp which do not have a link in Gp. The task is to predict whether there exists a link in the original graph G. Originally, Jaccard index is used for comparing the similarity and diversity of two sets A and B, and is defined as the size of the intersection divided by the size of the union of the two sets: A B J(A, B) = A B. (2) In [4], it is modified as: J(x, y) = Γ(x) Γ(y) Γ(x) Γ(y) 3) Adamic and Adar: Considering only common neighbor is not enough since we shouldn t equally view every existing link between node x or y and the common neighbor z. It is clear that a common neighbor is more representative and significant if the neighbor has less number of friends. Based on this idea, I choose Adamic/Adar as a mean to weight the common neighbors who have fewer friends more heavily and meanwhile to belittle the importance of the common neighbors whose friendship style is gregarious. Adamic and Adar considered whether the link can be connected or not by defining the similarity between two nodes to be 1 (4) log Γ(z) z Γ(x) Γ(y) (3) where z is the common neighbor of x and y. Fig. 3. Model of Link Prediction This problem can be formulated as a binary classification problem. For any given pair of nodes, we have to classify whether there should be an edge or not, using the attributes obtained from the graph. The goal is to find the optimal method and features for training classifier. Besides, limited memory is also a significant issue. I apply edge-sampling to solve this problem. A. Feature Extraction 1) Common Neighbors: Common Neighbors may be the most direct and intuitive feature for link prediction. The more 4) Preferential Attachment: One is more likely to gain new friend if he/she has more friends, that is, The rich get richer. Similarly, we can assume that a new link is more likely to be set up if the two end points have more neighbors. Therefore, I choose preferential attachment to be one of the features and denote it to be Γ(x) Γ(y) (5) The basic premise of preferential attachment is that the probability that a new link has node x as an endpoint is proportional to the current number of neighbor of x. 5) Simplified Katz: In addition to common neighbor, the most direct method of link prediction, counting the amount and length of paths between two nodes may also be helpful. Consequently, I choose Katz as a basis to describe the feature. Katz defined a measure that directly sums over this collection

3 of paths, exponentially damped by length to count short paths more heavily. This notion leads to the measure l=0 β l paths <l> x,y, (6) where paths are the set of all length-l paths from x to y, and β 0 is a parameter of the predictor. However, since the data set is too large, it is impossible and inefficient to sum path length l from 1 to infinity. I simplify Katz by merely summing l from 1 to 3. That is, in addition to common neighbor that the path length equals to 1, I further consider the path length equaling up to 3. The experimental results shows that this is enough for predicting new links. Note that a very small β yields predictions much like common neighbors since paths of length 3 or more contribute very little to the summation. I empirically set β to B. Training Set 1) Positive Training Data Sampling: In order to handle the large scale of the social network and avoid over-fitting while training, I sample part of the graph by edges as positive training data. I empirically set the number of sampling edges to There is no need to increase the number of training data since the improvement of testing accuracy does not indeed benefit from it. Also, growth of training data results in cost of training time. 2) Negative Training Data Sampling: This social graph is relatively sparse. To prevent the imbalance of the dataset, I randomly select two different nodes, and check whether these two nodes are adjacent to each other. If the two nodes are already neighbors, the edge is labeled as a positive edge. If not, the edge is then labeled as a negative one and be added to the negative training data set. C. Testing Set In this part, I remove the last 1% edges in terms of time from the original graphs to model the condition that people are friends in real world but not in the social network. In addition, I sample negative testing data with the same number of positive testing data to emulate real prediction condition. D. Baseline Model The baseline model in this prediction task is Naive Bayesian, a simple probabilistic model. Naive Bayesian applies Bayes theorem with the assumption that each attribute is independent to others. Two major advantages of Naive Bayesian are that (1) it requires little time to build the model, and (2) it requires small amount of training data to estimate the necessary variables such as mean and variances of the model. E. Model 1) Support Vector Machine: Support vector machine is one of the most commonly used classifiers today. It can be applied to a wide variety of data sets and is very reliable in the meanwhile. Given a set of n-dimensional training data, each is marked as one of the two classes. Support vector machine finds a (n- 1)-dimensional hyperplane, which can separate two classes of data with the largest margin. However, sometimes data sets are not linearly separable in the original feature space. We need to map the data from the original feature space into a higherdimensional space with the help of some transform functions. A set of functions called kernel functions are usually selected for this purpose. These functions have a common property that dot products can be computed in the original space, which ease the computational burden. I choose radial basis function as the kernel function, which shows a better result in empirical knowledge. 2) Logistic Regression: Logistic regression is a classification model that is derived from probabilistic perspective. It can perform both binary and multi-class classification by modeling the probability of each class. The probability of each class is modeled as a logit function, where the input for the logit function is a linear combination of features. Thus, from bayesian perspective, logistic regression is a generalize linear model. Generally, logistic regression is trained with gradient decent in a supervised fashion. In my link prediction experiment, binary outcome (e.g. exist link or not) is being classified. Therefore, a binary logistic regression model is trained. Because the features I use here are highly correlated, I adopt L 2 regularizer to avoid overfitting. A. SVM IV. EXPERIMENTAL RESULTS Below is the result of SVM with different penalty parameter C of the error term. Accuracy C= C= C= C= C= Finally, I select C=1.7 for model. B. Logistic Regression Below is the table of different inverse of regularization strength C for logistic regression model. Accuracy C= C= C= C= C= Finally, I select C=0.05 for model. C. Comparison between Nave Bayesian, SVM and Logistic Regression The following table shows the results among the three model using the best parameters.

4 Nave Bayesian SVM Logistic Regression Accuracy Precision Recall F1 Score As expected, the two proposed model outperform the baseline model. This may due to the dependency between features, which conflicts with the assumption of the Nave Bayesian model. With non-linear kernel, SVM can do a great job by separating two classes. Logistic Regression performs better than two other models. D. Effects of Unseen Nodes To see to what extent the new nodes affects the result, I remove the unseen data from the testing set. The following table shows the results among the three model using the best parameters. Nave Bayesian SVM Logistic Regression Accuracy Precision Recall F1 Score We can observe that both Nave Bayesian and SVM model benifit a lot by removing new nodes in terms of every metric, while Logistic Regression model loses points. We can further see that there is no significant difference between the performance of SVM and Logistic Regression model. From the facts, we can infer that (1) the Logistic Regression model is good at predict links that new users are involved, and (2) both SVM and Logistic Regression model are roughly good at predict links between two known users. E. Importance of Each Feature 1) Whole Testing Set: To see the importance of every feature, I first remove each feature. The results of removing each feature using Nave Bayesian model are shown in the following table. CommonNbor Jaccard AdamAdar preferattach Shortest Path Simplified Katz All features The results of removing each feature using SVM model are shown in the following table. CommonNbor Jaccard AdamAdar preferattach Shortest Path Simplified Katz All features The results of removing each feature using Logistic Regression model are shown in the following table. CommonNbor Jaccard AdamAdar preferattach Shortest Path Simplified Katz All features In general, for Nave Bayesian and SVM models, Preferential Attachment is the most important feature, while Simplified Katz is the least important feature that does not help a lot. Without the Preferential Attachment, the baseline model can t even perform better result than random guess. As for Logistic Regression Model, Adamic and Adar is the most important feature, while Preferential Attachment seems to have negative effect on prediction. 2) Remove New Nodes from Testing Set: The results of removing each feature using Nave Bayesian model are shown in the following table. CommonNbor Jaccard AdamAdar preferattach Shortest Path Simplified Katz All features The results of removing each feature using SVM model are shown in the following table. CommonNbor Jaccard AdamAdar preferattach Shortest Path Simplified Katz All features The results of removing each feature using Logistic Regression model are shown in the following table. CommonNbor Jaccard AdamAdar preferattach Shortest Path Simplified Katz All features We can observe that by viewing only the seen users, Simplified Katz becomes a useful feature for SVM model, while Preferential Attachment still does not help Logistic Regression model to predict. We can conclude this situation by saying that one feature important feature to a model may not also be important to other model, and can even be helpfulless. V. CONCLUSION In this paper, I model the friendship recommendation problem to the link prediction problem. On the basis of supervised learning methods, I select useful features and models to attain

5 accurate prediction. I choose common neighbor, Jaccards Coefficient, Adamic and Adar, Preferential Attachment and Simplified Katz as features and choose Support Vector Machine and Logistic Regression as models. The performance of the model is evaluated by accuracy, precision, recall, and F-measure rate. Since I concern about the correctness of predicting links that truly exist. Generally speaking, Logistic Regression achieves the highest accuracy and F-1 score. Therefore, I believe that model built by Logistic Regression is the best one. REFERENCES [1] Menon, Aditya Krishna, and Charles Elkan. Link prediction via matrix factorization. Joint european conference on machine learning and knowledge discovery in databases. Springer Berlin Heidelberg, [2] Viswanath, B., Mislove, A., Cha, M., and Gummadi, K. P. (2009, August). On the evolution of user interaction in facebook. In Proceedings of the 2nd ACM workshop on Online social networks (pp ). ACM. [3] Dunlavy, Daniel M., Tamara G. Kolda, and Evrim Acar. Temporal link prediction using matrix and tensor factorizations. ACM Transactions on Knowledge Discovery from Data (TKDD) 5.2 (2011): 10. [4] LibenNowell, D., Kleinberg, J. (2007). The linkprediction problem for social networks. journal of the Association for Information Science and Technology, 58(7),

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,