Link Analysis in Weibo - PDF Free Download

Link Analysis in Weibo Liwen Sun AMPLab, EECS liwen@cs.berkeley.edu Di Wang Theory Group, EECS wangd@eecs.berkeley.edu Abstract With the widespread use of social network applications, online user behaviors, ranging from information search to marketing, have been greatly changed, due to the connected nature of these platforms. In this project, we study the link analysis problem, which analyze these links and predict new links. With these techniques, online platforms can make effective recommendations to users so that the world is more connected, which opens the door to business opportunities and information sharing. We apply machine learning techniques and consider both global methods such as SVM and logistic regression and local methods such as k-nearest neighbor. We compare the performance of these tools on both effectiveness and efficiency. 1 Introduction...in an information-rich world, the wealth of information means a dearth of something else: a scarcity of whatever it is that information consumes. What information consumes is rather ob- vious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of informa- tion sources that might consume it. Herbert Simon, Designing Organizations for an Information-Rich World, 1971. Online social networking platforms have become tremendously popular in recent years and keep pushing the boundary of online business. The largest sites like Facebook and Twitter constantly add thousands of enthusiastic new users daily to their existing hundreds of millions of actively engaged users. Micro-blogging services, in particular, have rapidly gained its popularity due to the Figure 1: Growth of Weibo Users rise of Twitter. Through limited-length messages (one hundred and fourty characters at most), micro-blogging sites have demonstrated its power of information spreading as a crowd-powered social media. In China, Weibo, which literally stands for microblog in Chinese, is a micro-blogging service after Twitter. Two major weibo sites, Tecent and Sina, have grown their user base into over three hundred millions each in two years of time, as shown in Figure 1. Viewed as a similar social platform as Twitter, Weibo has its unique feature. One primary distinction is that a message of one hundred and fourty characters in Chinese can be much more informative than in English. It also has unique social effect in China s society, e.g., the ability to spread forbidden messages like wildfire before those messages get censored. In this project, we study the link analysis problems on Weibo. Social platforms bring online users together and form a gigantic social graph, where users and other entities serve as nodes, and their interactions, such as friend and follow relationships serve as edges. Such link information becomes an important source to analyze and predict user behaviors. In particular, the link recommendation problem aims at helping users connect with others and make this social graph connected more densely.

data file details type size user profile age, gender, num of tweets list of records 2.3M users user keywords extract keywords in tweets list of records 2.3M users item attributes category, tags list of records 6.5K items follows follow history graph 2M nodes, 50M edges actions mention, retweet, comment graph 2M nodes, 10M edges Table 1: Dataset Description Motivation. The service host is incentivized to identify such pairs and send a recommendation to them, as they might be interested to connect with each other. From the users perspective, they would like to seek the right information source of interest and connect with the people they know. For business users and celebrities, they are eager to be seen and followed by more people so that they can gain expanded coverage and increased popularity. By making the social network more connected, the service host can maintain user stickiness and generate traffic. However, with inappropriate recommendations, the system would flood users with huge volumes of information and hence puts them at risk of information overload. Reducing the risk of information overload is a priority for improving the user experience and it also presents opportunities for novel data mining solutions. Thus, capturing users interests and accordingly serving them with potentially interesting items (e.g. news, games, advertisements, products), is a fundamental and crucial feature in Weibo and other social platforms. To generate effective link recommendatioins, we analyze the proximity, or similarity, of nodes (users) in the social graph. If a pair of nodes exhibit high network proximity and yet no direct links exist between them, e.g., they are not friends yet, then a recommendation for the pair may be accepted with a high chance. Also, we also consider the attributes of nodes, such as age and gender of users and the keywords from their online posts, as features for similarity analysis. Overall, our focus is on exploit the hybrid of link- and content-based information and identify useful features. Recent advances in data mining and machine learning provide us abuandant tools for our problem of link analysis. Among others, logistic regression, SVM, and k- nearest neighbor are among the most important ones and have been successfully applied to a wide range of problems [2, 3]. We explore these state-of-the-art machine learning techniques and compare models on our problem. We observed that in a connected social graph, local methods, such as k-nearest neighbor, can significantly outperform the global methods, such as SVM and logistic regression. Moreover, k-nearest neighbor has the best performance when we set k = 20, which means we should look at small local neighborhood. Computationally, as expected, k-nearest neighbor algorithms are much more expensive to compute than those linear methods, such as linear SVM and logistic regression. The remainder of this report is structured as follows. Section 2 reports the Weibo dataset we use. Section 3 presents problem definition. We discuss the approaches we use in Section 4. Some computational issues are addressed in Section 5. We show the experimental results in Section 6. Section 7 concludes the paper and outline some future directions. 2 Dataset Tencent Weibo is one of the largest micro-blogging websites in China. Since its launch in April 2010, it has become a major platform for building friendship and sharing interests online. Currently, there are more than two hundred million registered users on Tencent Weibo, generating over fourty million messages each day. Recently, Tencent Weibo released a sample of its data, which is the first known Weibo data at this scale. This dataset is summarized in Table. Note that each item is also a user, and items and users share the same domain of ids. Thus we are able to look up the corresponding information in user profile data for a given item id. Items are different from other users in that they are to be recommended to users, and thus addtional information for them are provided, such as category. A large proportion of items are celebrities and special groups. Also note that item set (6K) come from only a small fraction of the user set (2M). This also gave us opportunity to optimize our learning algorithms, as will shown in later sections. Training Data. Apart from the dataset itself, we have training data avaiable as well. The training set contains 70M records. A record in a training set is in the following format: (i,u,c) where i is an item, being any entity in Weibo such as a user, a celebrity and a group, u represents a user, and c { 1,1} is the class label, where 1 indicates u accepted the recommendation to follow i and 1 stands for otherwise. When the trained classifier is applied to a test pair (u,i), it outputs 1 or 1 as the classification result. 2

4.1 Similarity Analysis To predict links, our approach is to identify those pairs of nodes that are not currently connected. Thus, we need some metrics, or features, to quantify the proximity, or similarity, of two nodes in a graph. we have the following two kinds of similarity: Figure 2: Recommender System in Weibo Snapshot 3 Problem Definition Given the various kinds of information of the data and an adequate number of training set, our task is to select three items from the item set and recommend them to users. This three-item recomendation scheme is actually used in the real Weibo system, as shown in Figure 3. Objective Function. Given a recommendation in the format of a (user, item)-pair, we compute a real-value score indicating how likely the user will accept the recommendation and go ahead to follow the item. For each user, we have a list of items recommended to the user and their score computed. We then rank these items based on this score, and pick the top-3 items. To measure the effectiveness of these 3-item ranked results, we use average precision: Suppose m items in an ordered list are recommended to a user, who may click zero, one, or more of them to follow, the average precision at n for this user is defined as: p n = k=1,...,n P(k)Rel(k) m where if the denominator is zero, the result is set zero; P(k) means the precision at cut-off k in the item list, i.e., the ratio of number of clicked items up to the position k over the number k, and P(k) equals 0 when k-th item is not followed upon recommendation; n = 3 as this is the default number of items recommended to each user in Weibo. where p(k) is precision at k; Rel(k) is the indicator of whether the user follows the k-th recommended item; and n is set as 3, which is the default number of items recommended to a user. The score of the algorithm will be the p n for all users. 4 Approaches In this section, we explore different machine algorithms and study how they can applied to our link prediction problem. Attribute-based Similarity. We can extract some attribute-based information from the data, such as user profile and category of items. To quantify the similarity of two nodes, we can estimate the similarity of such information. For example, we can derive the cosine similarity of users keywords, or age difference. These values will be fed into our machine learning algorithms as features. Link-based Similarity. Another important source of information is how to estimate the proximity of two nodes solely based on graph structure. For example, we can estimate the number of common neighbors of two nodes, or their cosine similarity. Our focus in this project is to measure the effectiveness of the two category of information and study how to effectively combine the two. 4.2 Global methods We first consider the most straightforward global classification methods SVM and logistic regression. For each (i,u,c) observations where u is in our training set (recall u is the user, i the item recommended to u, and c indicates whether u accept the recommendation.), we compute a feature vector for the user-item pair containing information gathered from our data related to u and i. The features we used includes the gender and age of u, keywords similarity between u and i, popularity of i, number of common neighbors between u and i in the following graph, and the number of items u followed that are in the same category of i. The feature vectors are the points, we then run the linear SVM and logistic regression on these set of points with the corresponding c s as the labels of the points. Both algorithms will produce a vector after going over the training data. For each pair testing user and an item, we can compute the same set of features, and use the model we trained to compute a score of how likelly the user will follow the item. We recommend the items which a testing user is most likely to follow. Another method we implemented is based on SVD. We consider the user-item following matrix A, and computed the k-dimensional subspace that best approximates A. Formally, we find orthonormal matrices U,V and diagonal matrix D such that the Frobenius norm of the matrix A UDV is minimized. For each testing user, we 3

frequency count 100000 10000 1000 100 10 1 10 100 1000 10000 degree Figure 3: Degree distribution of our social graph Figure 4: Node Partition of the Social Graph project his/her vector to this k-dimensional space (with basis V ), and use the projected vector as the scores for each item. We consider this as a global approach since we use the projection directly, and we can also use the low dimensional projections as inputs to local methods. 4.3 Local methods Contrary to the global methods which aims to find the pattern in the entire dataset, we also consider local methods that try to pick out signals from the local neighborhoods of users. We experimented with both user-based and item-based local methods. In this setting, we focus on the follow history of a user, and don t include other features in our consideration. For the user-based K-nearest-neighbor algorithm, we view each user as a vector in the item space. For each testing user, we find the k most similar users of him/her in the training set in terms of the cosine similarity between their vectors, and predict the most popular items among his nearest neighbors that the testing user hasn t followed yet. For item-based algorithm, we computed the correlation between each pair of items (x,y). Let X i,y i be the indicator variables of wether user i followed x, y respectively. We consider them as two random variables, and for each user we obtain a paired observation. We compute the Pearson s correlation between the two items. For a testing user, the score for an item is the sum of the correlation of it and all the items we observed that this user followed. We then recommend the items with the highest scores that the testing user hasn t followed yet. 5 Computational Efficiency The problem we deal with has some nontrivial computational issues. The training set has millions of records. For each record, we have to compute a feature vector. The features based on graph structure information can be expensive to derive, since the graph has millions of nodes. Even if we only consider the local neighbor of nodes, some nodes have very large degrees and thus big neighborhood, due to the power-law degree distribution. In this section, we discuss two techniques that we can use to boost the speed of learning algorithms. 5.1 Materialized View We observe some primitive values that are frequently queried when building feature matrices. Processing these queries from scratch for each training example involves unnecessarily redundant computation. Therefore, we identify such primitive queries, precompute them and store them as materialized view so that queries can just look up the values when needed. We store the these information for user item entries. Although we have millions of users, the number of items is as small as six thousands, as given in Table. Therefore, the storage cost is still in a managable range. In detail, we have the following information stored in materialized view. number of common neighbors attribute value (e.g., age, gender) distribution of an item s fans category distribution for users following items 5.2 Parallelization Although we focus on link-based computations, which are connected and dependent in nature, we still can exploit some embarrassingly parallel tasks to make best use of our computer cluster of multi-core nodes. Precomputation. In the initial phase of materialized view construction, where we compute some values on top of the graph, we can parallelize the task of graph based computations. The tasks are essentially based on the set of nodes. We separate these tasks (or nodes) into several partitions, as depicted in Figure 5.2. For each partition, we replicate the entire graph structure, so that the cut does not really affect the link-based computation. Constructing Feature Matrix. As mentioned, one of the biggest bottlenecks in our computation is deriving feature matrix. We can actually partition the training set and derive feature matrix for each partition in parallel. Since each row in the matrix corresponds to a training 4

example, we can combine these horizonally partitioned sub-matrix finally, to form a overall feature matrix. 6 Performance Analaysis In this section, we report the results from our experiemental study. We conduct our experiments on a big machine in AMPLab, which features 2 Intel E5645 at 2.40GHz, 288GB RAM and 16 1TB disk. Note that it has 24 cores so that we are able to implement parallelized, as discussed in Section 5.2. For linear classification methods, such as SVM and logistic regression, we use the liblinear library [1]. 6.1 Comparing Different Approaches Due to the time constraint, we conducted our experiment on the subset of users who followed more that 50 items. This yields a set of 12061 users. We split the users into training and testing users, and for each testing user, we hold out 30% of the items he/she follows, and try to predict them. We computed the average prediction rate of the algorithms discussed earlier, and the result is as follows. For baseline, we use the algorithm that always recommend the most popular items (in terms of number of followers) that a testing user hasn t followed so far. For KNN and LSI we only plot the result of the optimal parameters (i.e. num of neighbors, num of dimensions). Among the various algorithms, KNN really stands out, which suggests that the signals in local neighborhood is much stronger than the global pattern. Among the global methods, SVM and logistic regression performs on par with each other, and SVD predicts better than both of them. We do want to point out that SVD also takes much longer to compute than SVM and logistic regression. Also when the number of items is much smaller than number of users, the item-based correlation algorithm will be much more efficient than the other algorithms. It also worth noticing that SVM and logistic regression also consider information beyond the follow history, while KNN, LSI and item correlation use only the user-item following matrix. 6.2 Study on k-nn Performance To substantialize our intuition that the local neighborhood is more informative when it comes to recommendation, we run KNN with various k values. We show plot the results in Figure 7. Notice that when we take k as the total number of users, the algorithms is essentially equivalent to the baseline algorithm. 0.8 0.7 0.6 0.5 0.4 0.3 0.2 success rate (%) Baseline SVM Log. Reg. LSI- 25 KNN- 10 ItemCo Figure 5: Comparison of Approaches 0.76 0.74 0.72 0.7 0.68 0.66 0.64 0 50 100 150 200 250 300 k Figure 6: k-nn Performance vs. k 7 Conclusions and Future Work We study the problem of link analysis on a Weibo dataset, which is first ever released Weibo data in this scale. We apply different machine learning tools to this problem and compare the results. Since the dataset is in non-trivial scale, we try to address some computational issues we encounter and improve the efficiency. We observe that, in order to recommend items for users to follow, algorithms that utilize information from local neighborhood performs much better than algorithms trying to find global patterns. On the other hand, the global methods are more efficient to compute than algorithms like KNN, and a naive implementation of KNN would be more infeasible for large scale datasets. Thus it will be interesting to speed up KNN to make it feasible for big data, probably via clustering and dimension reduction. References [1] Liblinear: http://www.csie.ntu.edu.tw/ cjlin/liblinear/. [2] BISHOP, C. M. Pattern Recognition and Machine Learning. 2006. [3] MANNING, C. D., RAGHAVAN, P., AND SCHTZE, H. Introduction to Information Retrieval. 2008. 5