VisoLink: A User-Centric Social Relationship Mining

Similar documents
Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Link Recommendation Method Based on Web Content and Usage Mining

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Mubug: a mobile service for rapid bug tracking

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Mobile Web User Behavior Modeling

Chapter 6: Information Retrieval and Web Search. An introduction

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Web Data mining-a Research area in Web usage mining

An improved PageRank algorithm for Social Network User s Influence research Peng Wang, Xue Bo*, Huamin Yang, Shuangzi Sun, Songjiang Li

The influence of caching on web usage mining

An Empirical Study of Lazy Multilabel Classification Algorithms

jldadmm: A Java package for the LDA and DMM topic models

SQTime: Time-enhanced Social Search Querying

CLASSIFICATION OF WEB LOG DATA TO IDENTIFY INTERESTED USERS USING DECISION TREES

Linking Entities in Chinese Queries to Knowledge Graph

RSDC 09: Tag Recommendation Using Keywords and Association Rules

Research on Design and Application of Computer Database Quality Evaluation Model

Mining for User Navigation Patterns Based on Page Contents

Multimodal Medical Image Retrieval based on Latent Topic Modeling

Text Document Clustering Using DPM with Concept and Feature Analysis

Mining User - Aware Rare Sequential Topic Pattern in Document Streams

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Ranking models in Information Retrieval: A Survey

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

A Navigation-log based Web Mining Application to Profile the Interests of Users Accessing the Web of Bidasoa Turismo

Automated Online News Classification with Personalization

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task

Evaluating the suitability of Web 2.0 technologies for online atlas access interfaces

Inferring User Search for Feedback Sessions

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

Behaviour Recovery and Complicated Pattern Definition in Web Usage Mining

User Contribution Measurement in Online Forum with Fraud Immunity

Spatial Latent Dirichlet Allocation

Parallelism for LDA Yang Ruan, Changsi An

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 2013 ISSN:

Ontological Topic Modeling to Extract Twitter users' Topics of Interest

A System for Identifying Voyage Package Using Different Recommendations Techniques

Video annotation based on adaptive annular spatial partition scheme

Modelling Structures in Data Mining Techniques

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Jianyong Wang Department of Computer Science and Technology Tsinghua University

A Data Classification Algorithm of Internet of Things Based on Neural Network

Framework Research on Privacy Protection of PHR Owners in Medical Cloud System Based on Aggregation Key Encryption Algorithm

RECOMMENDATIONS HOW TO ATTRACT CLIENTS TO ROBOFOREX

Survey on Recommendation of Personalized Travel Sequence

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Web Usage Mining: A Research Area in Web Mining

Keywords Data alignment, Data annotation, Web database, Search Result Record

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Classification with Class Overlapping: A Systematic Study

ResPubliQA 2010

Method to Study and Analyze Fraud Ranking In Mobile Apps

Taccumulation of the social network data has raised

An Application of Genetic Algorithm for Auto-body Panel Die-design Case Library Based on Grid

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Characterizing Web Usage Regularities with Information Foraging Agents

Prioritizing the Links on the Homepage: Evidence from a University Website Lian-lian SONG 1,a* and Geoffrey TSO 2,b

A Fast Personal Palm print Authentication based on 3D-Multi Wavelet Transformation

Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications

The Curated Web: A Recommendation Challenge. Saaya, Zurina; Rafter, Rachael; Schaal, Markus; Smyth, Barry. RecSys 13, Hong Kong, China

A New Technique to Optimize User s Browsing Session using Data Mining

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

THE STUDY OF WEB MINING - A SURVEY

Collaborative Filtering using Euclidean Distance in Recommendation Engine

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

FSRM Feedback Algorithm based on Learning Theory

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Proxy Server Systems Improvement Using Frequent Itemset Pattern-Based Techniques

A Survey on Postive and Unlabelled Learning

An Adaptive Threshold LBP Algorithm for Face Recognition

Pattern Classification based on Web Usage Mining using Neural Network Technique

Theme Identification in RDF Graphs

Competitive Intelligence and Web Mining:

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Supervised Random Walks

Chapter 27 Introduction to Information Retrieval and Web Search

Instructor: Stefan Savev

Exploring archives with probabilistic models: Topic modelling for the European Commission Archives

Privacy-Preserving of Check-in Services in MSNS Based on a Bit Matrix

P2P Contents Distribution System with Routing and Trust Management

Overview of Web Mining Techniques and its Application towards Web

Review on Techniques of Collaborative Tagging

Minimal Test Cost Feature Selection with Positive Region Constraint

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

ISSN: [Shubhangi* et al., 6(8): August, 2017] Impact Factor: 4.116

A New Evaluation Method of Node Importance in Directed Weighted Complex Networks

Fraud Detection of Mobile Apps

A Web Recommendation System Based on Maximum Entropy

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Bipartite Graph Partitioning and Content-based Image Clustering

A Decision-Theoretic Rough Set Model

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Making Privacy a Fundamental Component of Web Resources

Ontology Generation from Session Data for Web Personalization

Transcription:

VisoLink: A User-Centric Social Relationship Mining Lisa Fan and Botang Li Department of Computer Science, University of Regina Regina, Saskatchewan S4S 0A2 Canada {fan, li269}@cs.uregina.ca Abstract. With the popularity of Web 2.0 websites, online social networking has thriven rapidly over the last few years. Lots of research attention have been attracted to the large-scale social network extraction and analysis. However, these studies are mostly beneficial to sociologists and researchers in the area of social community studies, but rarely useful to individual users. In this paper, we present a friends ranking system - visolink which is a personal social network analysis service based on user s reading and writing interest. In order to provide a better understanding to user s personal network, a weighted personal social representation and visualization are proposed. Our system prototype shows a much more user friendly design on personal networks than the classical node-edge distance based network visualization. Key Words: Web mining, Social network, User centric 1 Introduction Writing blogs, sharing photos and videos are the most popular user behaviors on the Web. In the past two years, Web 2.0 brought lots of user participation onto Internet, especially in the area of social networking. Millions of users are contributing contents including texts, pictures and videos to the social network sites. These huge amounts of contents and user activity patterns on the Web become a great source for social network analysis and Web data mining. Recently, researchers from computer science and sociology have been attracted to computational social networking study [2] [4] [5]. With the number of participants in online social networks increasing dramatically, for managing social relationships online, a common feature from the current online social networking sites is to provide users a linear Friend List. The problem with this list is that while the number of contacts increases, users hardly find out the most important friends in the list. One proposed solution from Anthony Dekker is to define the distance function between network entities based on the frequency of the communications of the user with other friends [1]. However, traditional daily communications is hard to be captured and recorded without a mechanism.

Blog-based social networking sites are content intensive. Most of the content reflects author s opinions and interests. From the computer science perspective, it contains much less noise data to mine user s interest. Our research motivation is to employ the latest Web Mining techniques to provide users a better way to manage their online social relationships. The proposed framework ranks user s friends based on their online reading and writing interest. In our system prototype, visolink also provides a user friendly graphical interface to present personal network. 2 Related Work Social network analysis mainly analyzes the relationships between people or groups of people within the social networks. Generally, a social network is computationally represented by a node-edge undirected graph. Most of the study in social network analysis use binary relationship representation. In [1], conceptual distance is considered in the social network analysis. The edge distance between every two entities in the social network, represents the closeness between two entities in the network. The link value is simply obtained by times of communication between two entities from daily life. For example, the value is assigned to 1.0, if the communication occurs every day; 0.6, if occurs once per week. It can be easily seen that the frequency of daily life communication is hard to be captured without a mechanism. Because of the popularity of blog, interest similarity measure between bloggers has attracted researchers attentions. [6] proposed an author-topic model to compute the similarity between authors over topics distributed on documents of their writings. Most of recent research works just focus on this kind of Web content analysis aspect using content mining techniques, but not on user s online activities pattern. The Web Mining technology opens the opportunity to mine relationships among users on the Web [7]. Times of online communications can be simply found from server log file. [2] evaluated the author-topic model and proposed their two-step method which combines probabilistic topics similarity in first step and finer content similarity measure in second step. The second step measuring considers the temporal factor of published post entries, since people s interest could be changed while time passes. The second step measure demonstrates the improvement by considering the time intervals related to author s interest. However, all of these methods are only based on author s writing interest. There are still lots of users surfing on the Web only being readers rather than writers. How to analyze user s reading interest? Web usage mining technique provides a possibility to find the solution. Web Usage Mining techniques are used to analyze user s behavior on a Website [7] [8] [14]. The study from [8] shows a proposed approach combining content and usage together to measure the similarity of behaviors between two visitors. In [10], authors introduce a model to find patterns between visitors in order to build an effective recommender system. Nevertheless, those studies are only classifying users based on their behaviors, but not their real interest.

3 The Proposed User-centric Personal Network In order to start our social network analysis, the proposed personal network is defined as follows: Each actor has his or her own network which is represented as a weighted graph G = (V, E, W ). In this network, a centric user represents the root node of the graph. Vertices V represent the friends of the centric user in the social network. The interest of each centric user is reflected by all the related content, including his or her own blog entries, and also other blog entries he or she browsed or read. Edges E represent the relationships between different users in the network. W denotes the weight of a relationship Rel(i, j) = W ij, Rel(i, j) denotes the relationship between user i and user j. W ij indicates the closeness between two users. According to our review study, there is nearly no previous research providing a mechanism to weight users social relationships. As a result, our study only focuses on personal network. Firstly, personal network is much less complex than the entire network. Secondly, personal network analysis is designed to be more user-oriented. Additionally, our proposed network design also considers that one relationship could have different values based on different centric-user. In other words, Rel(i, j) Rel(j, i). The importance of the relationship is different from each actor in the network. 4 User Interest Mining In order to weight different relationships for centric user, two basic principles for interest mining are needed to design. First one is: if two share more similar interest, these two contacts should consider to have a closer relationship. The second principle: More times one spending or more frequently visiting the other one s website indicates that the later one s site owner or site content is more interesting and important. Thus, based on these two principles, our task here is converted to user interest similarity measure. 4.1 Writing Content Analysis Writing content analysis concentrates on mining centric-user s self-generated content. Blog content mining has been studied in some recent research works [2] [3] [4] [5]. One of the two main approaches in the previous works is to utilize topic distribution model based on probabilistic theory. Another method uses the statistical term frequency content-based approach which is mainly used in the area of information retrieval. Each blog entry from blog websites may contain several topics. All the text corpus from each user is viewed as a combination of different topics. Each topic

occurring in a content corpus produces a probability value. With the help of entropy-based technology, such as KL-divergence, probabilities on the topics shared by two writers is able to be obtained. Topic model for learning the interest of authors from text corpus was introduced in [6] [8], and Rosen-Zvi proposed Author-Topic model to extend the basic LDA model [6]. Both of these two methods need to learn the parameters in estimation approach. In our study, the topic probability distributions are directly obtained from tags (keywords) distribution, since tags are inserted by authors themselves. Similar to the approach in [6], the similarity measure between user i and j is shown in Equation 1, D(i, j) = T t=1 [θ it log θ it θ jt + θ jt log θ jt θ it ], (1) where T denotes the set of topics, and θ it denotes the probability of topic t from user i. This method applies KL-divergence to compute the similarity between user i and j. The term-frequency model is well studied in the area of text document classification. After stop-word removal, spamming and low frequency terms removal, the terms in the text occurring more frequently contribute more importance to the whole document. According to [2], in its second stage of similarity computation, temporal factors are considered to affect the similarity. For example, the topics of two different pieces of content are very similar, but the interest similarity value is still low if the time interval between two published dates is large. According to [2], the similarity function is defined in Equation 2, where entry k denotes a blog entry from the entry set E it of user i, m(k) m(l) denotes the month difference of published date between entry k and entry l. Additionally, in Equation 2, λ takes the value 1, if it is set to consider time difference; otherwise, it takes 0. In order to take average similarity value from all the entry content, the sum of similarity values are divided by the numbers of total entries from user i and j which denote as n i and n j. Sim(i, j) = k E i l E j S(entry k, entry l ) e λ m(k) m(l) n i n j (2) 4.2 Reading Interest Analysis Measuring user interest based on blog entry content, however, only considers user s writing content on the Web. Although large number of Web users are contributing contents, the majority of the Web users are still readers. Based on this reality, detecting reading interest of users is highly necessary. Web log analysis is to study the access patterns of user s online activities. In the context of social networking, the browsing history of user i on j s website indicates user j s content is interested to user i. Therefore, if user i stays on page p longer than a threshold time length l, where p is not in E i. E i denotes the

pages of user i s personal website. It can be concluded that user i is interested in the content of page p. In the first stage of Web usage analysis, the raw data for usage analysis is extracted from the Web server log files. Since no user identities in Web Server log files which recorded IP address as client identification, problem encounters when multiple users logon using a same machine. Fortunately, In social networking websites, users log in and start their online social life with their own account. In our project, the logging history is extracted from application level, HTTP sessions. Once one logs in, the application would create a session for each user. Privacy issue may arise, if users do not want their browsing history being manipulated. As a result, in order to handle this situation, our proposed framework consider that browsing history is denied to be processed. A set of visited pages from browsing history for user i is denoted as R i. R i could be an empty set, if history data is denied to be processed. 4.3 Our Proposed Framework Combining Reading and Writing Interest Two set of pages are defined in our proposed framework. One is a set of pages of which are centric-user generated content. The second set of pages is from content which the centric user has read. Based on these two sets of content, the system tries to analyze the content not only what users write, but also what users read. It attempts to address the problem that some users prefer reading other s content rather than writing his/her own blog content, which is a very common phenomenon on the Web. The main task is to measure the similarity between centric-user i and a friend j. Due to the privacy issue needs to be considered, the whole measuring process is divided into five stages as follows: The similarity S 1 between user i and j based on their writings is computed using the Equation 3. The content data in this phase is from blog entries of user i and j. The result is multiplied by the weight factor β 0. Since users log data from both i and j is collected, the similarity S 2 between the content of i s writing and j s reading is able to be computed. The similarity result is multiplied by a weight factor β 1. Same to the process in phrase two, the similarity S 3 between the content of i s reading and j s writing is computed. The result is multiplied by a weight factor β 1. Similarly, the similarity S 4 between the content of i s reading and j s reading is computed. The result is multiplied by a weight factor β 2. Finally, we sum up S 1, S 2, S 3 and S 4 and then multiplies it with another weight factor α. alpha is a factor that considers how often user i visits j s website. If i visits j s website. User j means more important to user i. S 1 = Sim(W i, W j ) β 0, (3)

S 2 = Sim(W i, R j ) β 1, (4) S 3 = Sim(R i, W j ) β 1, (5) S 4 = Sim(R i, R j ) β 2, (6) Similarity(i, j) = (S 1 + S 2 + S 3 + S 4 ) α, (7) where Sim() function is content similarity measure function from Equation 2, weight factors β 0 > β 1 > β 2, W i denotes the writing content from user i. R i denotes the reading content of user i, and W j does not belong to R i. If user i denies the application to process log data, S 3 will take value 0. Similarly, if user j denies, S 2 takes 0. The values of weight factors β 0, β 1 and β 2 are defined as follows: β 0 > β 1 > β 2, because writing interest has more impact on reflecting personal interest than reading which could occur arbitrarily. α is the weight factor that indicates how often user i visits j s website. In section 4.1, in equation 1, the content analysis model is introduced. By replacing Sim(i, j) in equation 3 with equation 1, the similarity value between two users i and j is able to be obtained. After applying equation 3 to each relationship between each friend and centric-user, the values of ranking criteria for the friend list are generated. As a result, the system is able to rank the friend list based on the common sharing interest. Fig. 1: A screenshot from a user s blog-based personal website of system prototype visolink 5 System Prototype Implementation In order to evaluate our ranking method, the system prototype, namely visolink, has been under development. This prototype system provides the similar services

as the current online social networking sites, such as blog service, photo sharing and friendship management. Experimental data is collected when users are using the site. For example, topic probabilities are extracted from the user s blog post tagging annotation. User s reading behaviors are extracted from the server Web logs. As shown in Figure 1, the personal interest are mainly represented by his or her writing content of his blog-based personal website, such as blog posts, photo titles, descriptions and comments on the other s website. The final goal of the system is to present the ranking of social relationships. Actually showing the order of the ranking is more important than the actual ranking scores. As a result, system prototype visolink provides an enhanced view of friends ranking. Based on our principle system design concept, it is useful to show the order of online social relationship ranking, instead of show meaningless individual ranking score. As shown in Figure 2, the personal social network of centric-user Anson is generated from an automatic graph drawing algorithm. The main contact Anson, is placed into the center of the graph. Unlike the classical graph drawing using length of edges representing the distance between two entities, visolink visualizes the network by using vector-based graphical technique which allows those less important nodes being smaller and more transparent. This kind of representation of the network with criteria of clearness and node size is much better for users to judge which nodes are more important, rather than letting users to measure the distance or length between nodes by using their eyes. We design our visualization component to provide users a better understanding on their own personal networks. Most important contacts should be emphasized, and others that have low similarity values should be ignored. A fake 3D view of personal network is generated to end user as shown in Figure 2. visolink includes personal network friends ranking and recommendation. In the current phase, we have proposed a framework to generate ranking automatically. The prototype website has started to collect experimental user data. Fig. 2: A screenshot of our proposed visualization of personal network ranking result

6 Conclusions and Future Work In this paper, an approach combining content and usage analysis for user interest mining of online social networks has been proposed. It measures user s interests based on both users writing and reading interests. This similarity measure between online users provides a fundamental support for personal social network visualization and the personalized recommendation. The existing dataset online available for our system to perform experiment is hard to be found. Because both blog content and application logging data are needed. In the next phase of the project, we will perform evaluation experiments to examine the accuracy and effect of the ranking method from our own site visolink.com. A recommendation system based on online social relationship ranking will be explored in the future. References 1. Dekker, A.: Conceptual Distance in Social Network Analysis. Journal of Social Structure. 6(3) (2005) 2. Shen, D., Sun, J., Yang, Q., Chen, Z.: Latent Friend Mining from Blog Data. In: 6th International Conference on Data Mining, pp. 552-561. Hong Kong, China (2006) 3. Takama, Y., Matsumura A., Kajinami, T.: Interactive Visualization of News Distribution in Blog Space. In: 2006 IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology, pp. 413-416. IEEE Press, Hong Kong, China (2006) 4. Markrehchi, M., Kamel, M., S.: Learning Social Networks from Web Documents Using Support Vector Classifier. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 88-94. IEEE Press, Hong Kong, China (2006) 5. Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating Similarity Measures: A Large-Scale Study in the Orkut Social Network. In: 11th ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 678-684. Chicago, U.S.A (2005) 6. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: 20th conference on Uncertainty in artificial intelligence, pp. 487-494. Arlington, Virginia, U.S.A (2004) 7. Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents and Usage Data, Springer (2006) 8. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993-1022 (2003) 9. Murata, T., Saito, K.: Extracting User s interests from Web Log Data. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 343-346. Hong Kong, China (2006) 10. Mobasher, B., Dai, H., Luo, T., Sun, Y., Zhu, J.: Integrating Web Usage and Content Mining for More Effective Personalization. In: Int l Conf. on E-Commerce and Web Technologies, ECWeb2000, pp. 165-176. UK (2000)