VisoLink: A User-Centric Social Relationship Mining

Size: px

Start display at page:

Download "VisoLink: A User-Centric Social Relationship Mining"

Maximilian Perry
5 years ago
Views:

1 VisoLink: A User-Centric Social Relationship Mining Lisa Fan and Botang Li Department of Computer Science, University of Regina Regina, Saskatchewan S4S 0A2 Canada {fan, li269}@cs.uregina.ca Abstract. With the popularity of Web 2.0 websites, online social networking has thriven rapidly over the last few years. Lots of research attention have been attracted to the large-scale social network extraction and analysis. However, these studies are mostly beneficial to sociologists and researchers in the area of social community studies, but rarely useful to individual users. In this paper, we present a friends ranking system - visolink which is a personal social network analysis service based on user s reading and writing interest. In order to provide a better understanding to user s personal network, a weighted personal social representation and visualization are proposed. Our system prototype shows a much more user friendly design on personal networks than the classical node-edge distance based network visualization. Key Words: Web mining, Social network, User centric 1 Introduction Writing blogs, sharing photos and videos are the most popular user behaviors on the Web. In the past two years, Web 2.0 brought lots of user participation onto Internet, especially in the area of social networking. Millions of users are contributing contents including texts, pictures and videos to the social network sites. These huge amounts of contents and user activity patterns on the Web become a great source for social network analysis and Web data mining. Recently, researchers from computer science and sociology have been attracted to computational social networking study [2] [4] [5]. With the number of participants in online social networks increasing dramatically, for managing social relationships online, a common feature from the current online social networking sites is to provide users a linear Friend List. The problem with this list is that while the number of contacts increases, users hardly find out the most important friends in the list. One proposed solution from Anthony Dekker is to define the distance function between network entities based on the frequency of the communications of the user with other friends [1]. However, traditional daily communications is hard to be captured and recorded without a mechanism.

2 Blog-based social networking sites are content intensive. Most of the content reflects author s opinions and interests. From the computer science perspective, it contains much less noise data to mine user s interest. Our research motivation is to employ the latest Web Mining techniques to provide users a better way to manage their online social relationships. The proposed framework ranks user s friends based on their online reading and writing interest. In our system prototype, visolink also provides a user friendly graphical interface to present personal network. 2 Related Work Social network analysis mainly analyzes the relationships between people or groups of people within the social networks. Generally, a social network is computationally represented by a node-edge undirected graph. Most of the study in social network analysis use binary relationship representation. In [1], conceptual distance is considered in the social network analysis. The edge distance between every two entities in the social network, represents the closeness between two entities in the network. The link value is simply obtained by times of communication between two entities from daily life. For example, the value is assigned to 1.0, if the communication occurs every day; 0.6, if occurs once per week. It can be easily seen that the frequency of daily life communication is hard to be captured without a mechanism. Because of the popularity of blog, interest similarity measure between bloggers has attracted researchers attentions. [6] proposed an author-topic model to compute the similarity between authors over topics distributed on documents of their writings. Most of recent research works just focus on this kind of Web content analysis aspect using content mining techniques, but not on user s online activities pattern. The Web Mining technology opens the opportunity to mine relationships among users on the Web [7]. Times of online communications can be simply found from server log file. [2] evaluated the author-topic model and proposed their two-step method which combines probabilistic topics similarity in first step and finer content similarity measure in second step. The second step measuring considers the temporal factor of published post entries, since people s interest could be changed while time passes. The second step measure demonstrates the improvement by considering the time intervals related to author s interest. However, all of these methods are only based on author s writing interest. There are still lots of users surfing on the Web only being readers rather than writers. How to analyze user s reading interest? Web usage mining technique provides a possibility to find the solution. Web Usage Mining techniques are used to analyze user s behavior on a Website [7] [8] [14]. The study from [8] shows a proposed approach combining content and usage together to measure the similarity of behaviors between two visitors. In [10], authors introduce a model to find patterns between visitors in order to build an effective recommender system. Nevertheless, those studies are only classifying users based on their behaviors, but not their real interest.

3 3 The Proposed User-centric Personal Network In order to start our social network analysis, the proposed personal network is defined as follows: Each actor has his or her own network which is represented as a weighted graph G = (V, E, W ). In this network, a centric user represents the root node of the graph. Vertices V represent the friends of the centric user in the social network. The interest of each centric user is reflected by all the related content, including his or her own blog entries, and also other blog entries he or she browsed or read. Edges E represent the relationships between different users in the network. W denotes the weight of a relationship Rel(i, j) = W ij, Rel(i, j) denotes the relationship between user i and user j. W ij indicates the closeness between two users. According to our review study, there is nearly no previous research providing a mechanism to weight users social relationships. As a result, our study only focuses on personal network. Firstly, personal network is much less complex than the entire network. Secondly, personal network analysis is designed to be more user-oriented. Additionally, our proposed network design also considers that one relationship could have different values based on different centric-user. In other words, Rel(i, j) Rel(j, i). The importance of the relationship is different from each actor in the network. 4 User Interest Mining In order to weight different relationships for centric user, two basic principles for interest mining are needed to design. First one is: if two share more similar interest, these two contacts should consider to have a closer relationship. The second principle: More times one spending or more frequently visiting the other one s website indicates that the later one s site owner or site content is more interesting and important. Thus, based on these two principles, our task here is converted to user interest similarity measure. 4.1 Writing Content Analysis Writing content analysis concentrates on mining centric-user s self-generated content. Blog content mining has been studied in some recent research works [2] [3] [4] [5]. One of the two main approaches in the previous works is to utilize topic distribution model based on probabilistic theory. Another method uses the statistical term frequency content-based approach which is mainly used in the area of information retrieval. Each blog entry from blog websites may contain several topics. All the text corpus from each user is viewed as a combination of different topics. Each topic

4 occurring in a content corpus produces a probability value. With the help of entropy-based technology, such as KL-divergence, probabilities on the topics shared by two writers is able to be obtained. Topic model for learning the interest of authors from text corpus was introduced in [6] [8], and Rosen-Zvi proposed Author-Topic model to extend the basic LDA model [6]. Both of these two methods need to learn the parameters in estimation approach. In our study, the topic probability distributions are directly obtained from tags (keywords) distribution, since tags are inserted by authors themselves. Similar to the approach in [6], the similarity measure between user i and j is shown in Equation 1, D(i, j) = T t=1 [θ it log θ it θ jt + θ jt log θ jt θ it ], (1) where T denotes the set of topics, and θ it denotes the probability of topic t from user i. This method applies KL-divergence to compute the similarity between user i and j. The term-frequency model is well studied in the area of text document classification. After stop-word removal, spamming and low frequency terms removal, the terms in the text occurring more frequently contribute more importance to the whole document. According to [2], in its second stage of similarity computation, temporal factors are considered to affect the similarity. For example, the topics of two different pieces of content are very similar, but the interest similarity value is still low if the time interval between two published dates is large. According to [2], the similarity function is defined in Equation 2, where entry k denotes a blog entry from the entry set E it of user i, m(k) m(l) denotes the month difference of published date between entry k and entry l. Additionally, in Equation 2, λ takes the value 1, if it is set to consider time difference; otherwise, it takes 0. In order to take average similarity value from all the entry content, the sum of similarity values are divided by the numbers of total entries from user i and j which denote as n i and n j. Sim(i, j) = k E i l E j S(entry k, entry l ) e λ m(k) m(l) n i n j (2) 4.2 Reading Interest Analysis Measuring user interest based on blog entry content, however, only considers user s writing content on the Web. Although large number of Web users are contributing contents, the majority of the Web users are still readers. Based on this reality, detecting reading interest of users is highly necessary. Web log analysis is to study the access patterns of user s online activities. In the context of social networking, the browsing history of user i on j s website indicates user j s content is interested to user i. Therefore, if user i stays on page p longer than a threshold time length l, where p is not in E i. E i denotes the

5 pages of user i s personal website. It can be concluded that user i is interested in the content of page p. In the first stage of Web usage analysis, the raw data for usage analysis is extracted from the Web server log files. Since no user identities in Web Server log files which recorded IP address as client identification, problem encounters when multiple users logon using a same machine. Fortunately, In social networking websites, users log in and start their online social life with their own account. In our project, the logging history is extracted from application level, HTTP sessions. Once one logs in, the application would create a session for each user. Privacy issue may arise, if users do not want their browsing history being manipulated. As a result, in order to handle this situation, our proposed framework consider that browsing history is denied to be processed. A set of visited pages from browsing history for user i is denoted as R i. R i could be an empty set, if history data is denied to be processed. 4.3 Our Proposed Framework Combining Reading and Writing Interest Two set of pages are defined in our proposed framework. One is a set of pages of which are centric-user generated content. The second set of pages is from content which the centric user has read. Based on these two sets of content, the system tries to analyze the content not only what users write, but also what users read. It attempts to address the problem that some users prefer reading other s content rather than writing his/her own blog content, which is a very common phenomenon on the Web. The main task is to measure the similarity between centric-user i and a friend j. Due to the privacy issue needs to be considered, the whole measuring process is divided into five stages as follows: The similarity S 1 between user i and j based on their writings is computed using the Equation 3. The content data in this phase is from blog entries of user i and j. The result is multiplied by the weight factor β 0. Since users log data from both i and j is collected, the similarity S 2 between the content of i s writing and j s reading is able to be computed. The similarity result is multiplied by a weight factor β 1. Same to the process in phrase two, the similarity S 3 between the content of i s reading and j s writing is computed. The result is multiplied by a weight factor β 1. Similarly, the similarity S 4 between the content of i s reading and j s reading is computed. The result is multiplied by a weight factor β 2. Finally, we sum up S 1, S 2, S 3 and S 4 and then multiplies it with another weight factor α. alpha is a factor that considers how often user i visits j s website. If i visits j s website. User j means more important to user i. S 1 = Sim(W i, W j ) β 0, (3)

6 S 2 = Sim(W i, R j ) β 1, (4) S 3 = Sim(R i, W j ) β 1, (5) S 4 = Sim(R i, R j ) β 2, (6) Similarity(i, j) = (S 1 + S 2 + S 3 + S 4 ) α, (7) where Sim() function is content similarity measure function from Equation 2, weight factors β 0 > β 1 > β 2, W i denotes the writing content from user i. R i denotes the reading content of user i, and W j does not belong to R i. If user i denies the application to process log data, S 3 will take value 0. Similarly, if user j denies, S 2 takes 0. The values of weight factors β 0, β 1 and β 2 are defined as follows: β 0 > β 1 > β 2, because writing interest has more impact on reflecting personal interest than reading which could occur arbitrarily. α is the weight factor that indicates how often user i visits j s website. In section 4.1, in equation 1, the content analysis model is introduced. By replacing Sim(i, j) in equation 3 with equation 1, the similarity value between two users i and j is able to be obtained. After applying equation 3 to each relationship between each friend and centric-user, the values of ranking criteria for the friend list are generated. As a result, the system is able to rank the friend list based on the common sharing interest. Fig. 1: A screenshot from a user s blog-based personal website of system prototype visolink 5 System Prototype Implementation In order to evaluate our ranking method, the system prototype, namely visolink, has been under development. This prototype system provides the similar services

7 as the current online social networking sites, such as blog service, photo sharing and friendship management. Experimental data is collected when users are using the site. For example, topic probabilities are extracted from the user s blog post tagging annotation. User s reading behaviors are extracted from the server Web logs. As shown in Figure 1, the personal interest are mainly represented by his or her writing content of his blog-based personal website, such as blog posts, photo titles, descriptions and comments on the other s website. The final goal of the system is to present the ranking of social relationships. Actually showing the order of the ranking is more important than the actual ranking scores. As a result, system prototype visolink provides an enhanced view of friends ranking. Based on our principle system design concept, it is useful to show the order of online social relationship ranking, instead of show meaningless individual ranking score. As shown in Figure 2, the personal social network of centric-user Anson is generated from an automatic graph drawing algorithm. The main contact Anson, is placed into the center of the graph. Unlike the classical graph drawing using length of edges representing the distance between two entities, visolink visualizes the network by using vector-based graphical technique which allows those less important nodes being smaller and more transparent. This kind of representation of the network with criteria of clearness and node size is much better for users to judge which nodes are more important, rather than letting users to measure the distance or length between nodes by using their eyes. We design our visualization component to provide users a better understanding on their own personal networks. Most important contacts should be emphasized, and others that have low similarity values should be ignored. A fake 3D view of personal network is generated to end user as shown in Figure 2. visolink includes personal network friends ranking and recommendation. In the current phase, we have proposed a framework to generate ranking automatically. The prototype website has started to collect experimental user data. Fig. 2: A screenshot of our proposed visualization of personal network ranking result

8 6 Conclusions and Future Work In this paper, an approach combining content and usage analysis for user interest mining of online social networks has been proposed. It measures user s interests based on both users writing and reading interests. This similarity measure between online users provides a fundamental support for personal social network visualization and the personalized recommendation. The existing dataset online available for our system to perform experiment is hard to be found. Because both blog content and application logging data are needed. In the next phase of the project, we will perform evaluation experiments to examine the accuracy and effect of the ranking method from our own site visolink.com. A recommendation system based on online social relationship ranking will be explored in the future. References 1. Dekker, A.: Conceptual Distance in Social Network Analysis. Journal of Social Structure. 6(3) (2005) 2. Shen, D., Sun, J., Yang, Q., Chen, Z.: Latent Friend Mining from Blog Data. In: 6th International Conference on Data Mining, pp Hong Kong, China (2006) 3. Takama, Y., Matsumura A., Kajinami, T.: Interactive Visualization of News Distribution in Blog Space. In: 2006 IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology, pp IEEE Press, Hong Kong, China (2006) 4. Markrehchi, M., Kamel, M., S.: Learning Social Networks from Web Documents Using Support Vector Classifier. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp IEEE Press, Hong Kong, China (2006) 5. Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating Similarity Measures: A Large-Scale Study in the Orkut Social Network. In: 11th ACM SIGKDD international conference on Knowledge discovery in data mining, pp Chicago, U.S.A (2005) 6. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: 20th conference on Uncertainty in artificial intelligence, pp Arlington, Virginia, U.S.A (2004) 7. Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents and Usage Data, Springer (2006) 8. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, (2003) 9. Murata, T., Saito, K.: Extracting User s interests from Web Log Data. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp Hong Kong, China (2006) 10. Mobasher, B., Dai, H., Luo, T., Sun, Y., Zhu, J.: Integrating Web Usage and Content Mining for More Effective Personalization. In: Int l Conf. on E-Commerce and Web Technologies, ECWeb2000, pp UK (2000)

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,