Detecting Opinion Spam in Commercial Review Websites

Size: px

Start display at page:

Download "Detecting Opinion Spam in Commercial Review Websites"

Shanon Welch
5 years ago
Views:

1 Detecting Opinion Spam in Commercial Review Websites by Huayi Li B.E., Computer Science and Technology, Nanjing Normal University, 2009 THESIS Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Chicago, 2016 Chicago, Illinois Defense Committee: Bing Liu, Chair and Advisor Ugo A. Buy Chris Kanich Philip S. Yu Sherry L. Emery, Institute for Health Research and Policy

2 This thesis is dedicated to my wife Yu Ding and my son Carlos D. Li. ii

3 ACKNOWLEDGMENTS First and foremost, I would like to thank my Ph.D. advisor Professor Bing Liu for his mentoring and advising without any reservation. Working with him to pursue my Ph.D. degree is my greatest honor. From him I learned valuable research experience and skills for my future work and study. I sincerely appreciate his encouragement and patience along the way. His continuous effort in shaping me into a researcher I am today is essential to the success of my Ph.D. study. Were it not for him, this dissertation would not have been possible. I also want to thank Professor Philip S. Yu, Professor Ugo Buy, Professor Chris Kanich, and Professor Sherry Emery for taking their valuable time to serve on my dissertation committee. Their constructive advices and suggestions contribute a lot to this thesis. During my Ph.D. study, I made friends with many colleagues and visiting scholars who have helped me a lot through discussions and collaborations: Zhiyuan Chen, Arjun Mukherjee, Jidong Shao, Jianfeng Si, Geli Fei, Shuai Wang, Nianzu Ma, Yueshen Xu, Weixiang Shao, Xiaokai Wei and so on. Last but not least, I owe many thanks to my parents for their consistent support. I truly indebted to my wife Yu for her unconditional trust and love. The happiness and hardness we went through together are an indispensable part of my Ph.D. study. HL iii

4 CONTRIBUTION OF AUTHORS Chapter 3 presents a published manuscript (Li et al., 2015) for which I was the primary author. Zhiyuan Chen, Arjun Mukherjee and my advisor Professor Bing Liu contributed to discussions about the preliminary ideas and assisted in revising the manuscript. Jidong Shao is the manager of Dianping s spam detection team who kindly shared us their data and gave us many insights of their fake review filtering system. Chapter 4 presents a published manuscript (Li et al., 2014a) for which I was the primary author. My advisor Professor Bing Liu and Arjun Mukherjee contributed to the idea of traditional PU Learning. Xiaokai Wei and Zhiyuan Chen shared their opinions and suggestions on collective classification. Jidong contributed to the revising of the manuscript. Chapter 5 presents a published manuscript (Li et al., 2014c) for which I was the primary author. Arjun Mukherjee and my advisor Professor Bing Liu contributed to the discussion about the preliminary ideas and assisted in revising the manuscript. Rachel Kornfield and Sherry Emery provided their dataset and labeled the training data for our model. iv

5 TABLE OF CONTENTS CHAPTER PAGE 1 INTRODUCTION Temporal And Spatial Patterns of Opinion Spam Detecting Opinion Spammers using Positive Unlabeled Learning Detecting Campaign Promoters on Twitter RELATED WORKS Classification and Ranking Network based approach Bursty Reviews Spammer Groups Detection Campaign Promoter Detection TEMPORAL AND SPATIAL PATTERNS OF OPINION SPAM Introduction Dataset Dataset Statistics Schema of Review Entities Opinion Spam Analysis Meta-data Patterns Temporal Patterns Spatial Patterns Temporal and Spatial Patterns Network Patterns Rating Patterns Opinion Spam Detection DETECTING OPINION SPAMMERS USING POSITIVE AND UNLABELED LEARNING Introduction Problem Definition Collective Classification Models on Heterogeneous Networks Collective Classification on reviews Multi-typed Heterogeneous Collective Classification Collective Positive-Unlabeled Learning Model Experiment Datasets and Language Characteristics Experiment Settings v

6 TABLE OF CONTENTS (Continued) CHAPTER PAGE Training and Testing Evaluation Metrics Compared Methods Model Comparison Posterior Analysis DETECTING CAMPAIGN PROMOTERS ON TWITTER Introduction Promoter Detection Model Markov Random Fields Loopy Belief Propagation T-MRF T-MRF For Promoter Detection Node Types Edge Potentials Node Potentials: Prior Belief Estimation Overall Algorithm Experiments Datasets and Settings Results CONCLUSIONS APPENDICES CITED LITERATURE VITA vi

7 LIST OF TABLES TABLE PAGE I Entity schema II Proposed Features III Results based on 10-fold cross validation IV States of different entities. +, -, u are abbreviations for states.. 37 V Important Notations VI Statistics of the 500 restaurants in Shanghai VII Confusion matrix: Positive is the fake review, Negative is the unlabeled review. TP: True Positive, FP: False Postive, FN: False Negative, TN: True Negative VIII Disagreement counts of a pair of models(a/b). The disagreement counts are based on all the nodes of a connected component in entire network IX Important Notations X Propagation matrix ψ i,j (σ i, σ j t i, t j ) for each type of edge potentials 73 XI Data statistics XII AUC (Area Under the Curve) for each dataset, each system and different ɛ values vii

8 LIST OF FIGURES FIGURE PAGE 1 The number of reviews v.s. the number of different entities Site registration distribution of users Client distribution of reviews Review posting pattern of days of week User registration pattern of days of week Normalized time-series of the number of spammers and non-spammers registered on a 14-day period Simple Moving Average Geographical distribution of users and IPs in major cities of China Histogram of IPs for various distance to Shanghai Abnormal patterns measured by ATS CDF of users and IPs v.s. number of other entities Segment of the user-cookie network Cookies(black), spammers(red), non-spammers(blue) Rating distribution of fake and truthful reviews Rating of user fake singleton reviews v.s. other fake reviews Histogram of absolute average rating deviation of fake views from genuine reviews for restaurants Sample network of the users, reviews and IP addresses Evaluation of Different Models based on 10-fold cross validation. Note: although Spy-EM has the highest recall, it over-predicts the positive class and hence the lowest precision F1-score of all models given various training size CDF of the number of fake reviews for suspicious IPs v.s. organic IPs CDF of the number of spammers for suspicious IPs v.s. organic IPs Disagreement of HCC and CPU on a small segment of one of the connected component of the entire heterogeneous network. Blue nodes are IPs, Red nodes are users and black ones are reviews. Labels of IPs are sequence numbers, labels are users are left empty. Labels of reviews are short names listed in Table Table VI elsewise they are set to empty Disagreement of MHCC and CPU on a small segment of one of the connected component of the entire heterogeneous network. Blue nodes are IPs, Red nodes are users and black ones are reviews. Labels of IPs are sequence numbers, labels are users are left empty. Labels of reviews are short names listed in Table Table VI elsewise they are set to empty Burstiness of CDC 2012 campaign dataset A simple example of User-URL-Burst network viii

9 SUMMARY Review websites have become very important platforms for consumers to compare and evaluate products or services. However, review systems are often attacked by opinion spam. Even though opinion spam or fake reviews detection has attracted significant research attention in recent years, the problem is far from being solved. One key reason is that there are no large-scale ground truth datasets available for model building and evaluation. Most existing approaches use pseudo fake labels rather than real fake labels for reviews in the commercial setting. In recent years, review hosting sites such as Yelp and Dianping have built robust fake review filtering systems to ensure the quality of their reviews. Our work is facilitated by the commercial spam filter of Dianping. Collaborating with Dianping s data scientists, we present the first large-scale analysis of restaurant reviews filtered by their fake review detection system. Rather than the focus of linguistic and behavioral features of individual users in other studies, we propose several novel temporal and spatial features (Li et al., 2015) which demonstrate fundamental differences between spammers and non-spammers. We leverage these features and largely improve existing supervised opinion spam detection models. Furthermore, we found that Dianping s algorithm has a very high precision but unknown recall which means that reviews filtered are of high confidence but it is not necessarily true that the unfiltered reviews are all genuine. Thus, it is more appropriate to treat unfiltered reviews as an unlabeled instances. We thus propose a novel collective classification algorithm (Li et al., ix

10 SUMMARY (Continued) 2014a) leveraging the strong correlations among reviews, reviewers and IPs to detect opinion spam. Finally, we found spammers often work in groups. A spam group or a spam community is more damaging as they can together launch a spam attack in a stealth mode. Each individual reviewer may not look suspicious, but a bigger picture of all of them sheds light on the collusive behaviors of a spam community. Thus, identifying such community is fairly important. We propose to model users temporal activities by the Hidden Markov Model and then build a co-burst network which is very effective to capture spammers collusive behaviors than the traditional reviewer-product network used in most existing works. x

11 CHAPTER 1 INTRODUCTION The Internet has made it much easier for people to express their opinions and show their thoughts. Nowadays, people from anywhere in the world can post reviews of products and services to share their views and opinions. Opinions in reviews are also increasingly used by consumers to make decisions and improve products/services. Positive opinions can directly translate to profits for businesses thus many imposters are now gaming the system by posting fake reviews to promote or demote some target businesses (Jindal and Liu, 2008). As a large part of the society relies on social opinions for decision making, analyzing and detecting opinion spam is very important to ensure the trustworthiness of online reviews. Otherwise, online social media could become a place that is full of lies and deceptions. Given the critical issues of opinion fraud in online communities, how can one identify fake reviews and attribute responsible culprits behind them? In this thesis, we present a couple of research problems that result in much needed solutions to this emergent, prevalent, and socially impactful problem. Our work covers four key problems in the study of detecting opinion spam: leveraging temporal and spatial patterns of spammers, detecting spammers via positive and unlabeled learning, detecting spammer groups using their collective co-bursting activities and detecting campaign promoters. 1

12 2 1.1 Temporal And Spatial Patterns of Opinion Spam Partnering with Dianping, we present the first large-scale analysis of restaurant reviews filtered by Dianping s fake review detection/filtering system. Unlike other academical datasets which were crawled from review host sites, our dataset is shared by Dianping directly. Dianping s commercial review filtering system produces a label for each review indicating whether it is trustworthy or fake. Besides the dataset is very large, containing over 6 million reviews of all restaurants in Shanghai. Due to the large size and the complete coverage of all restaurants, it allows us to compute reliable statistics of the dynamics of opinion spamming. To the best of our knowledge, no existing study has been performed on such a large scale. Compared with other datasets, reviews in our dataset come with a much richer context, including users IP addresses/cookies, users profile (user registration information), and restaurants meta-data (category, geolocation), which have never been used in any published opinion spam research. These additional data allow us to create more useful features for building machine learning models to spot review spammers. Although, Dianping s filter may not be perfect, patterns discovered by our analyses bolster their classifier s reliability. Apart from the analysis of this large scale review dataset, we also propose several novel temporal, spatial, and other features which demonstrate fundamental differences between spammers and non-spammers. We then leverage these features (Li et al., 2015) to improve supervised opinion spam detection. That proposed features significantly outperform existing state-of-art linguistic and behavioral features.

13 3 Beyond standard behavioral features used in existing works, our work is the first to give comprehensive insights of temporal and spatial features at various levels (reviews, users, IPs, and cookies). All our proposed features are also domain and language independent. They are thus applicable to reviews in other domains and other languages. The features and patterns that we propose in this paper can help build markedly more accurate classification models. 1.2 Detecting Opinion Spammers using Positive Unlabeled Learning Dianping is confident that reviews detected by their algorithm is highly accurate, but recall is hard to know. This means that those filtered reviews from the system are very pure but it is not necessarily true that the unfiltered reviews are all genuine. Thus, it is more appropriate to treat unfiltered reviews as an unlabeled set. Such problems are not typical for the model that learns from positive and unlabeled examples. To the best of our knowledge, this approach has rarely been used in fake review detection and it is in fact more appropriate for the problem. By leveraging the strong correlations between reviews, users and IP addresses, we propose a novel collective classification algorithm named Multi-typed Heterogeneous Collective Classification (MHCC) and then extend it to Collective Positive and Unlabeled learning (CPU) (Li et al., 2014a). Experimental results show that our proposed models can markedly improve the performance of strong baselines in both PU and non-pu learning settings. Since our proposed features are also domain and language independent, our proposed models can be easily applied to reviews in other domains and any other languages.

14 4 1.3 Detecting Campaign Promoters on Twitter In addition to Dianping, we also collaborate with the Health Media Collaboratory in University of Illinois at Chicago. Social Media such as Twitter is now widely used for marketing and for predicting stock markets (Si et al., 2013; Si et al., 2014; Harsley et al., 2016). Their team kindly shared us with the Twitter dataset of several stop-smoking campaigns. As social media becomes a important resource of public information, consumers are increasingly using social media as a platform to promote their products or services. Promoters often influence peoples behaviors in a hidden fashion which the readers are not aware of. It is thus critical to identify such campaigns and their promoters and understand how they operate. However, so far not much work has been done in this specific field. In this paper, we are going to study such problems on social media using Twitter as an example. We characterize the problem as a relational classification and apply Markov Random Fields to solve it. Our experiments on three real-life datasets related to stop-smoking show that our proposed method performs very well comparing with other baseline works.

15 CHAPTER 2 RELATED WORKS 2.1 Classification and Ranking (Ott et al., 2011) built supervised learning models using unigrams and bigrams and (Mukherjee et al., 2013) added many behavioral features to improve it. Further, Positive Unlabeled Learning is shown to be very useful for spamming detection (Li et al., 2014b; Ren et al., 2014). Ranking based approaches are also used in spam detection. Some researchers constructed a heterogeneous network of users/reviews and products, and then employed HITS-like ranking algorithm (Wang et al., 2011), others applied Loopy Belief Propagation (Fei et al., 2013; Akoglu et al., 2013; Rayana and Akoglu, 2015) to rank reviews or reviewers. 2.2 Network based approach In the past few years, many researchers incorporated network relations into opinion spam detection. Most of them constructed a heterogeneous network of users/reviews and products. Loopy Belief Propagation (Fei et al., 2013; Akoglu et al., 2013; Rayana and Akoglu, 2015) and collective classification (Li et al., 2014a; Xu et al., 2013) are very effective in this problem because they are well-known relational classifiers. In this work, we propose to build a network using co-bursting relations and it is shown to be more effective in capturing the intricate relations between spammers especially for raised accounts. We argue that our work can largely benefit these above works with co-bursting. 5

16 6 2.3 Bursty Reviews Bursty reviews have been studied recently by many researchers. (Fei et al., 2013) studied the review time-series for individual products. They assumed reviewers in a review burst of a product are often related in the sense that spammers tend to work with other spammers and genuine reviewers with other genuine reviewers. They thus applied Loopy Belief Propagation to rank spammers using heuristic propagation parameters. Similarly, (Xie et al., 2012) analyzed multiple time-series of a single retailer including daily number of reviews, average rating, and ratio of singleton reviews. Their end task is to find the time intervals in which a spam attack happens to a retailer. However our work focuses on studying individual reviewers reviews timeseries and its goal is to identify individual spammers. Other researchers applied various Bayesian approaches to detect anomalies in rating time-series (Günnemann et al., 2014a; Günnemann et al., 2014b; Hooi et al., 2015). However, our model only requires the timing of each review and the byproduct of our model can be used to detect spammer groups effectively. Incorporating rating signals to our model will be part of our future work. 2.4 Spammer Groups Detection Although recent progress has been made to uncover such spam groups (Mukherjee et al., 2012; Xu et al., 2013; Ye and Akoglu, 2015), evaluation of spammer groups was troublesome as the ground truth of large-scale dataset was sadly unavailable. Thanks to the labels from Dianping s system, for the first time, evaluation on the clustering results of our co-burst networking against co-review network is made possible.

17 7 2.5 Campaign Promoter Detection The problem of detecting promoters in Twitter is closely related to detection of Twitter spam. Benevenuto et al. (Benevenuto et al., 2010) studied the problem of identifying Twitter spammers. They manually labeled a large collection of users from which they trained a traditional classifier using both tweet content and user behavior features. We also incorporate the content and behavior features into the local classifier of our model. In our case, the local classifier is only used to produce the prior probabilities for each user node. Chris et al. (Grier et al., 2010) did an interesting analysis of unethical use of Twitter. They showed that 8% of URLs in tweets point to phishing, malware, and scam sites listed in popular URL blacklists. Twitter is an effective platform for coercing users to visit targeted webpages with a click-through rate of 0.13%. Even though URLs that are promoted in the campaign are not necessarily harmful, their work indicates a close relationship between Twitter users and URLs. However, their work does not detect promoters. Several other researchers also provided some detailed analysis of Twitter spam accounts. Thomas et al. (Thomas et al., 2011) studied the underlying infrastructure of spam marketplace and identified a few spam companies. Although some promoters behave in a similar way to spammers, there are also a large number of promoters who are participating in a campaign legitimately especially in non-profit campaigns. Two of our datasets belong to this category. Thus, the criteria and techniques used in Twitter spam detection cannot be directly applied to campaign promoter detection. Campaign detection has been a popular research topic for years. (Gao et al., 2010) conducted experiments on Facebook messages and applied a clustering algorithm to cluster those

18 8 messages. (Lee et al., 2011) extended the work and provided three different approaches to extract campaigns from message graphs. (Zhang et al., 2012) instead constructed a graph of user accounts and extracted dense sub-graphs as campaigns. However, our work is clearly different from them in that our goal is to perform user-level classification to detect individual promoters. Further, any promoters may not be connected in the graph. Benevenuto et al. (Benevenuto et al., 2009) built a traditional classifier to solve the promoter detection problem of YouTube users. As their study is on YouTube, their features derived from video attributes (such as video duration, number of views and comments and so on) are not directly applicable in our problem. However, we adopt their approach to our context and use it as a baseline and also as our local classifier in our evaluation. Since their approach did not incorporate the rich context of networks and relational information of users, the classification results are markedly poorer than our proposed T-MRF method.

19 CHAPTER 3 TEMPORAL AND SPATIAL PATTERNS OF OPINION SPAM (This chapter includes and expands on my paper previously published in Huayi Li, Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Jidong Shao. Analyzing and Detecting Opinion Spam on a Large-scale Dataset via Temporal and Spatial Patterns. In ICWSM 2015.) 3.1 Introduction The Internet has made it much easier for people to express their opinions and show their thoughts. Nowadays, people from anywhere in the world can post reviews of products and services to share their views and opinions. Opinions in reviews are also increasingly used by consumers to make decisions and to improve products/services. Positive opinions can directly translate to profits for businesses thus many imposters are now gaming the system by posting fake reviews to promote or demote some target businesses (Jindal and Liu, 2008). As a large part of the society relies on social opinions for decision making, analyzing and detecting opinion spam is very important to ensure the trustworthiness of online reviews. Otherwise, online social media could become a place that is full of lies and deceptions. Despite the prevalence of opinion spam, existing methods are not keeping pace due to the unavailability of large-scale ground truth datasets in the real world commercial setting which impedes research of opinion spam detection. Existing work typically relies on pseudo fake labels rather than real fake labels. For example, Jindal and Liu (2008) simply treated duplicate 9

20 10 and near-duplicate Amazon product reviews as fake reviews. (Ott et al., 2011) hired Amazon Mechanical Turkers write fake hotel reviews for them. The review dataset that they compiled had only 800 reviews which is too small to support reliable statistical analysis. Furthermore, the motivations and the psychological states of mind of hired Turkers and the professional spammers in the real world can be quite different as the results shown in (Mukherjee et al., 2013). Companies such as Dianping 1 and Yelp 2 have developed effective fake review filtering systems against opinion spam. (Mukherjee et al., 2013) reported the first analysis of Yelp s filter based on reviews of a small number of hotels and restaurants in Chicago. Their work showed that behavioral features of reviewers and their reviews are strong indicators of spamming. However, the reviews they used were not provided by Yelp but crawled from Yelp s business pages. Due to the difficulty of crawling and Yelp s crawling rate limit, they only obtained a small set of (about 64,000) reviews. In this chapter, we study a large scale real-life restaurant review dataset with fake review labels provided by Dianping s spam detection algorithm. Our work is demarcated from all previous works in the following dimensions: Data Volume: Unlike aforementioned datasets which were crawled from review host sites, our dataset is shared by Dianping directly. It is very large, containing over 6 million

21 11 reviews of all restaurants in Shanghai. Due to the large size and the complete coverage of all restaurants, it allows us to compute reliable statistics of the dynamics of opinion spamming. To the best of our knowledge, no existing study has been performed on such a large scale. Data Richness: Compared with other datasets, reviews in our dataset come with a much richer context, including users IP addresses/cookies, users profile (user registration information), and restaurants meta-data (category, geolocation), which have never been used in any published opinion spam research. These additional data attributes allow us to create more useful features for building machine learning models to spot review spammers. Label Accuracy: Dianping s commercial review filtering system produces a label for each review indicating whether it is trustworthy or fake. Those labels are reliable due to two reasons. Firstly, Dianping has a team of anti-spam experts who manually evaluate their filter s performance on a weekly basis. Second, a notification message is sent to its reviewer if a review is classified as fake. The low rate of mis-classified fake reviews based on users feedback indicates the effectiveness of the Dianping system. This means that the precision of Dianping s filter is very high, although it is not clear what the recall is. Feature Novelty: Beyond standard behavioral features used in existing work, we are the first to give comprehensive insights of temporal and spatial features at various levels (reviews, users, IPs, and cookies). The features and patterns that we propose in this paper can help build markedly more accurate classification models.

22 12 So, what are the most discriminative features between spammers and non-spammers, how do fake reviews ratings deviate from those of genuine reviews, how widespread and intensive is opinion spamming, how much damage can opinion spam cause, how do spammers behave differently from non-spammers temporally and spatially, and how to effectively detect spammers? Those are important questions that our paper strives to answer in the following sections. We also leverage the proposed novel features for spammers detection on the real-world fake review dataset coded with industrial filtering. Experimental results show that the proposed features significantly improve detection accuracies beyond the existing state-of-art methods. 3.2 Dataset The review dataset used in this paper is from Dianping.com. Dianping was founded in 2003 in Shanghai and is of similar scale to Yelp 1. It now has 190 million monthly active users, over 60 million reviews as of the 4th quarter of Before discussing the analysis and detection of opinion spam, we first describe the data and its schema Dataset Statistics Our reviews retrieved from Dianping s data warehouse consist of all reviews of all restaurants in Shanghai from November 1st, 2011 to April 18th, The dataset is not only much larger than those review datasets used in existing studies but also contains class labels produced by Dianping s fake review filter. We further infer the class label of users and IP addresses by

23 13 considering the majority class of all their reviews. That is, users/ips are considered as spam users/ips if more than 50% of their reviews are filtered by Dianping (i.e., fake). However, unlike users and IPs, the percentage of fake reviews for restaurants distribute fairly uniformly making the label of restaurant hard to infer, which also shows that faking is widespread across almost all restaurants. The large number of reviews, users, IPs and restaurants enables us to conduct a wide range of analyses that have never been done before. For simplicity, we use spam to represent fake reviews, spammers to refer to users who write fake reviews and spam IPs to represent IPs with a majority of fake reviews and non-spam to represent truthful reviews, and authentic users and organic IPs to represent users and IPs with less than one half of their reviews that are fake, respectively. It can be seen from the table that about one fourth of the reviews are identified as fake. If it was not for the filter, fake reviews will cause serious damage to the fair market in Dianping. It is worthwhile to note that the percentage of spammers is lower than that of fake reviews echoing the fact that a spammer tends to write many fake reviews. Figure 1 shows the plot of the number of reviews versus the number of different entities (restaurants, users and IPs) in log scale. From Figure 1(a), we can see that most restaurants have only 10 or fewer reviews, and for these restaurants there are much more fake reviews than truthful ones. This finding makes sense as many less well-known restaurants need some fake/deceptive reviews to bootstrap (initially promote) their businesses. In contrast, restaurants with a large number of reviews are usually popular and do not have to fabricate fake reviews as they are already well-accepted by consumers. Consequently, fake reviews to restaurants with

24 14 # of Restaurants spam non-spam # of Reviews (a) # of Users spam non-spam # of Reviews (b) # of IPs spam non-spam # of Reviews (c) Figure 1. The number of reviews v.s. the number of different entities very few reviews are more harmful. From Figure 1(b), we can see that unlike restaurants, the number of organic users is more than spammers for any number of reviews. Reviews of nonspammers generally follow the power law distribution. However, it is surprising that spammers who write less than 40 reviews exhibit different decaying rate than spammers who write more fake reviews. The elbow shape in the log-log plot seems to indicate two kinds of spammers: (a) there is a large number of light weight (naïve spammers) v.s. (b) there are much fewer heavy weight (expert) spammers. IPs in Figure 1(c) reveal a similar trend that the majority of IPs are non-spam IPs Schema of Review Entities In addition to the data volume or the large scale, entities in our dataset contain many new attributes (listed in Table I) that haven t been explored so far in the literature. The prominent attributes of interest in this study are IPs, cookies on users registration, sites through which

25 15 Restaurant User Attribute Description Attribute Description rest id restaurant id user id user id name restaurant name name user name reg date date of registration reg date date of registration lat latitude of restaurant reg ip IP on registration lng longitude of restaurant reg cookie cookie on registration cat food category reg site site where user registered Review Attribute Description review id review id rest id restaurant id user id user id label label of review generated by fake review filter (spam or non-spam) datetime date and time when review is submitted rating review rating range from 1 to 5 client device used when writing the review (e.g. PC, Android, iphone) ip IP of user s device cookie cookie in user s browser text review content TABLE I Entity schema users are registered as well as the IPs and cookies when users use to post their reviews. Note that the IP and cookie a user used in writing a review is not necessarily the same as used in registration. We further map all the IPs in our database to their city level locations via the IP2location API by Datatang 1. The locations of IPs encourage spatial analysis of spammers behaviors. 1

26 Opinion Spam Analysis In this section, we begin to analyze various patterns between spammers and non-spammers. The characteristics of opinion spam we found can be used as features to detect and even prevent opinion spam by increasing the cost of spamming. Other WAP Mobile apps spammers non-spammers Groupbuy site Main site 0% 10% 20% 30% users % Figure 2. Site registration distribution of users 40% 50% 60% 70% Meta-data Patterns The first interesting finding we observe is the spammers registration pattern. Dianping uses different site names to categorize the portals through which users register. Main site means

27 17 reviews % 70% 60% 50% 40% 30% 20% 10% 0% PC Android WinPhone iphone client ipad WAP site spam non-spam other Figure 3. Client distribution of reviews the main website where most users are registered. Group-buy website 1 is a subordinate service of Dianping which is an equivalent of Groupon. Other portals include mobile apps, WAP and various third-party entries. Figure 2 shows the distribution of spammers and the percentage is based on each class. Since registering accounts through Dianping s main site is the fastest and most convenient way, spammers show a preference in registering on the main site. Another benefit of registration on the main site is that, spammers can quickly start writing fake reviews once registered. Consequently, the percentage of reviews with fake review labels posted from Dianping s main site is the highest among all other clients (Figure 3). 1

28 Temporal Patterns Now we would like to show some longitudinal studies along the time dimension. Review posting time: Although we did not find any interesting patterns from the timeseries of fake reviews versus authentic reviews, we spot noteworthy review posting patterns as shown in Figure 4. Spammers are more active in weekdays except for Mondays and less active in weekends comparing to organic users. This demonstrates that many spammers may be parttime workers who usually are busy on Monday with their own work and occasionally write fake reviews in other weekdays. On the contrary, non-spammers who write authentic reviews based on their real personal experiences are more likely to post reviews on Sundays and Mondays after returning from dinner parties or hangouts that happen over weekends. reviews % 18% 17% 16% 15% 14% 13% 12% 11% 10% spam non-spam Mon Tu Wed Thur Fri Sat Sun day of week Figure 4. Review posting pattern of days of week

29 19 # of users 18% 17% 16% 15% 14% 13% 12% 11% 10% spammers non-spammers Mon Tu Wed Thur Fri Sat Sun day of week Figure 5. User registration pattern of days of week User registration time: A similar study of the number users registered on days of week reveals the same pattern as shown in Figure 5 supporting the conclusions drawn from Figure 4. We also conducted similar longitudinal study for user s registration and discovered different bursty natures of time-series plotted in Figure 6. The variance of spammer s time-series is clearly larger than non-spammers. The Chinese New Years (based on the lunar calender) account for the sudden drops of both time-series, however there are more valleys and peaks on the number of spammers, which indicates the anomaly of spammers registration. Further, we investigated spammers accounts in the peaks and found a fair number of them share the same cookie in a short period of time. More interestingly, some of the spammers with the same cookie registered with s from the same domain and some of their user names even share

30 spammers non-spammers Dec 2011 Chinese New Year Mar 2012 Jun 2012 Sep 2012 Dec 2012 Mar 2013 Chinese New Year Jun 2013 Sep 2013 Dec 2013 Mar 2014 Chinese New Year Figure 6. Normalized time-series of the number of spammers and non-spammers registered on a 14-day period Simple Moving Average. the same prefixes/suffixes. Thus, ranking users based on abnormal behaviors in registration can lower the effect of reviews composed by those users. With a carefully designed early spam detection module, damages caused by fake reviews can be prevented or at least significantly reduced Spatial Patterns An interesting question we investigate here is that in order to maximize the profits from writing fake reviews, do spammers work for restaurants in the other cities beyond where they reside? In other words, is opinion spamming geographically outsourced? To answer this, we analyze the spatial patterns of spammers. Although our dataset does not allow us to visualize the complete geographical distribution of IPs of all the reviews from a certain user because the

user may have reviews for other restaurants outside Shanghai, we focus on the geographical distributions of reviews IPs with respect to reviews in Shanghai as well as distributions of IPs that users

31 21 (a) distribution of spammers v.s. non-spammers registered in IPs (b) distribution of malicious IPs v.s. organic IPs in reviews Figure 7. Geographical distribution of users and IPs in major cities of China. user may have reviews for other restaurants outside Shanghai, we focus on the geographical distributions of reviews IPs with respect to reviews in Shanghai as well as distributions of IPs that users used to register. First, we would like to see how spammers and non-spammers distribute across the major cities in China assuming the city where a user is registered is the city where the user lives. We map the IPs that users used in registration to a city level coordinate and we use geo-tagged pie charts for visualization as illustrated in Figure 7(a). The size of a pie chart represents the total number of users mapped to the city and its color portion reflects the ratio of spammers to nonspammers. Due to the fact that most users are registered in Shanghai, we exclude them from the study as we want to focus on the users outside Shanghai. There are two observations from this chart: (1) People in large cities (a few biggest charts) are dominated by non-spammers. This

32 22 # of IPs malicious IPs organic IPs distance to Shanghai (in miles) Figure 8. Histogram of IPs for various distance to Shanghai makes sense because people in large cities have higher salary and are likely to travel to Shanghai for vacation or business purposes. So their reviews are likely to be authentic because they write reviews given their own experiences; (2) the further the cities are from Shanghai, the higher the ratios of spammers. A possible explanation is the travel cost. As the distance increases, the chance of people traveling to Shanghai drops and this is especially true when it comes to the underdeveloped cities in the western part of China where the profits of writing spam is reasonably attractive given the local average income. We can say that opinion spamming exhibits geographical outsourcing. Secondly, we notice that spammers may register multiple accounts to spam in groups. We aggregate the users by IPs and visualize malicious IPs versus organic IPs in Figure 7(b). The second plot again confirms our previous conjecture about geographical outsourcing. To better

33 23 interpret the observation that a city s malicious IP ratio increases as its distance to Shanghai increases, we convert 2D map representation to the 1D distance measure. We use side by side histogram in Figure 8 to represent the malicious IP ratio as a function of distance to Shanghai. It can be easily seen from the chart that IPs in cities that are 200+ miles away from Shanghai are mostly malicious Temporal and Spatial Patterns In addition to the success of finding individual temporal and spatial patterns, there are more novel dynamics to explore when we combine spatial and temporal dimensions. Following the hypothesis that some professional spammers frequently change IP addresses to register many accounts in a short period of time, we postulate such spammers would also change IP addresses frequently when posting reviews to somehow fool Dianping s fake review filtering system. We thus propose a novel metric to quantify the abnormal behaviors of such spammers and we call it the Average Travel Speed (ATS) measure. We define S u as the sequence of reviews ordered by posting time-stamp of a reviewer u. Each review r i consists of two primary attributes, IP address and time-stamp. By looking up all the IPs via IP2Location API, we tag each IP with a pair of coordinates of the city where it locates. Only 3.2% of IPs are not found in the IP2Location database, we thus remove the reviews pertaining to those IPs for each user. The ATS measure aims to simulate the traveling sequence of a user. It averages the speed (miles per second) of a user from one location to the next in the sequence of movement. The rationale is that users who frequently and randomly move all over China with unusual speeds are highly likely to spammers. The formal definition of ATS is in Equation 3.1 and

34 24 Equation 3.2. The function distance takes in two geo-coordinates and returns the Vincenty distance of the two points on earth. ATS of users with only one review is set to zero since ATS requires at least two reviews. Note that the IPs that spammers use can be IPs of proxy rather than the actual IPs of their end devices. Thus the ATS measure can also spot abnormal behaviors of frequent switching between IPs that are far apart. S u =< r 1, r 2,..., r Su > where r i = (t, IP, loc) and r i.t < r j.t for i < j (3.1) AT S u = Su i=2 distance(r i.loc, r i 1.loc) S u 1 (3.2) # of users spammers non-spammers Average Travel Speed (miles per second) Figure 9. Abnormal patterns measured by ATS

35 25 There are two caveats for this analysis. First, we only have a complete set of reviews of restaurants in Shanghai between November 1st, 2011 and April 18th, Therefore, reviews of users to restaurants outside Shanghai are not counted. Second, city location of IPs that we retrieved from the IP2Location database may not reflect the correct city locations of IPs as of the time when reviews were posted. In spite of these issues, we found many users whose ATS are exceptionally high which we can see from Figure 9. Most users are stationary who barely change IPs or city locations. It is also noteworthy that the majority of the users with unusually fast mobility rate are filtered by Dianping showing that novel spatio-temporal dynamics such as the average travel speed can be useful in spam detection Network Patterns Recent advances in spam detection also involved applications of heterogeneous networks by means of collective classification (Li et al., 2014a) and belief propagation (Wang et al., 2011; Akoglu et al., 2013). However, due to the lack of information such as IPs, cookies, most of those studies primarily studied relations between users and restaurants/businesses. Thus, we decide to delve into the correlations among different entities in our review dataset. We hypothesize that spammers switch IP addresses/cities and even browser cookies more often than ordinary/genuine users, we plot the cumulative distribution function (CDF) of the number of spammers and non-spammers as a function of the number of IPs (Figure 10 a), cookies (Figure 10 b), cities (Figure 10 c) respectively. We found that for all three charts, the CDF curves of spammers are always below non-spammers which means that a reasonable number of professional spammers are inclined to use different IPs and cookies to prevent themselves

36 26 CDF of users CDF of users # of IPs (a) # of cities (c) CDF of users CDF of IPs # of cookies (b) spam non-spam # of cookies (d) Figure 10. CDF of users and IPs v.s. number of other entities from being detected by the review rate limit per IP or cookie. In other words, naive spammers who intensively post reviews in a short period of time (high review rate) to promote or demote restaurants can be easily detected and disabled. This is a possible explanation to the elbow shape of log-log plot in Figure 1(b). Since spammers are much fewer than non-spammers, if the average number of cookies used by spammers is approximately the same as that used by non-spammers, cookies associated with non-spam IPs would outnumber cookies with spam IPs. However, in Figure 10 (d), we still observe more spam IPs with a large number of cookies than non-spam ones. Realizing cookies are better identifiers of real users than user ids or IP addresses at least from the perspective of browsers, we construct a large network of users and cookies to visu-

27 Figure 11. Segment of the user-cookie network Cookies(black), spammers(red), non-spammers(blue) alize network patterns. Figure 11 shows a small segment of the entire network.

37 27 Figure 11. Segment of the user-cookie network Cookies(black), spammers(red), non-spammers(blue) alize network patterns. Figure 11 shows a small segment of the entire network. Cookies are colored in black and spammers are colored in red while non-spammers in blue. There are many dense clusters in which the majority of the nodes are spammers reflecting the collusion among spammers. We believe that group spammers detection technique (Mukherjee et al., 2012) is applicable here to collectively identify more spammers that a spam filter may fail to detect Rating Patterns Overall review rating: Besides the fake review filter, Dianping also ranks reviews based on their quality. Therefore, low quality and deceptive reviews are ranked lower than authentic reviews to minimize the damage caused by fake reviews. However, another important aspect of opinion spam is the review rating. Like reviews in other sites, Dianping uses a scale of 1

38 28 reviews % 30% 25% 20% 15% 10% 5% 0% spam non-spam rating Figure 12. Rating distribution of fake and truthful reviews (worst) to 5 (best) stars to rate restaurants. Each restaurant has a dynamic average rating score that changes over time. Since fake reviews are created deliberately for promoting or demoting businesses, the profits of businesses are threatened by them gradually and invisibly. Thus we also need to study the rating behaviors of spammers and take actions to detect such unfair practices as soon as possible. In Figure 12 we plot the rating distribution of fake reviews and genuine/truthful reviews. Genuine reviews are usually more conservative in their ratings as the highest rating they give is usually not 5 stars but 4 stars. By comparison, spammers who exhibit extreme unbalanced sentiment write more 5-star reviews than others. Singleton review rating: From Figure 1 (b) we know that, most reviewers write one review. Those reviews are called singleton reviews (Xie et al., 2012). By splitting fake reviews into fake singleton reviews and fake non-singleton reviews, we can drill even deeper on the

39 29 reviews % 35% 30% 25% 20% 15% 10% 5% 0% user singleton fake reviews other fake reviews rating Figure 13. Rating of user fake singleton reviews v.s. other fake reviews rating behaviors of spammers. Figure 13 reveals a major shift of rating distribution for fake reviews from Figure 12. Not surprisingly, the proportion of 5-star reviews for fake singleton reviews is even higher. We also see that the proportion of 1-star fake reviews are increased by a large margin. Spam Rating Deviation: The last thing we want to show is how fake reviews average rating deviates from genuine reviews for each restaurant. The average rating deviation of a particular restaurant is calculated by subtracting the average rating of genuine reviews from that of fake reviews. We then show the histogram of absolute value of average rating deviation with respect to restaurants in Figure 14. In this chart, positive and negative deviations are combined together to illustrate the significance of rating disparity. We find that, for 56.6% of the restaurants, there is at least 0.5 star offset between fake reviews and genuine ones. And

40 30 # of restaurants absolute average rating deviation Figure 14. Histogram of absolute average rating deviation of fake views from genuine reviews for restaurants astonishingly, there are 31.9% restaurants whose fake reviews and genuine reviews have one 1- star difference. So identifying fake reviews is important and also urgent because the difference of 1 star can greatly affect the business of the restaurants. 3.4 Opinion Spam Detection In this section, we want to test the efficacy of our discovered patterns in Section 4 using a supervised learning approach to classify spammers and non-spammers (users). We first enumerate the proposed features and then compare the performances of our features with other linguistic and behavioral features in related works using our large-scale real-life dataset. In light of the above analyses, we now propose a set of user level features that are strong indicators of opinion spammers. These new features are listed in Table II. The second column

41 31 Feature Name Description Reference regmainsite Boolean variable indicating whether the user is registered on main site of Dianping Figure 2 regtu2tr Boolean variable indicating whether the user registered between Tue. and Thur. Figure 5 regdist2sh Distance from the city where the user registered to Shanghai Figure 7(a) ATS Average Travel Speed Figure 9 weekendpcnt % of reviews written at weekends Figure 4 pcpcnt % of reviews posted through PC Figure 3 avgdist2sh Average distance from the city where the user posts each review to Shanghai Figure 7(b) AARD Average absolute rating deviation of users reviews Figure 14 uips # of unique IPs used by the user Figure 10(a) ucookies # of unique cookies used by the user Figure 10(b) ucities # of unique cities that IPs that users used in writing reviews Figure 10(c) TABLE II Proposed Features is the feature description and the last column refers to the figures that we plotted. Some of features/measures show in the figures are not at the user level, so we define the features with respect to users. Please refer to the feature descriptions in the table for details. The spammers class is a very skewed class so we under-sample a subset of non-spammers and combine it with spammers to form a balanced set of users following the existing work in (Mukherjee et al., 2013; Ott et al., 2011). We then randomly shuffle the set of users and create 10 equal sized bins for 10-fold cross validation. The experimental results reported below are from 10-fold cross validation.

42 32 Method Accuracy Precision Recall F1 Unigram and Bigram (Ott et al., 2011) Behavioral Features (Mukherjee et al., 2013) Proposed New Features Combined TABLE III Results based on 10-fold cross validation Our compared state-of-art baselines are Support Vector Machines (SVM) with linguistic features (Ott et al., 2011) and behavioral features (Mukherjee et al., 2013). We also use SVM under the same setting as the supervised learning model for the proposed features. Since our problem is a binary classification task, we choose to evaluate our proposed features using four types of metrics: Accuracy, Precision, Recall and F1 score. All metrics except accuracy are based on the positive class (spammer) as review hosting companies are interested in identifying those spammer accounts. Table III shows the performances of the baselines, our proposed features, and the combination of all features respectively. From the results we can see that the proposed new features markedly outperforms the state-of-art baselines for all four metrics because it captures more subtle characteristics of opinion spammers. The combination of them achieves even better results. The performances of the baseline methods are generally consistent with those in (Mukherjee et al., 2013) on the Yelp dataset.

43 CHAPTER 4 DETECTING OPINION SPAMMERS USING POSITIVE AND UNLABELED LEARNING (This chapter includes and expands on my paper previously published in Huayi Li, Zhiyuan Chen, Bing Liu, Xiaokai Wei, and Jidong Shao. Spotting fake reviews via collective positiveunlabeled learning. In ICDM 2014.) 4.1 Introduction Opinions in reviews are increasingly used by consumers to make decision and share ideas. Unfortunately, imposters now game the system by posting fake reviews to promote or demote some target businesses (Jindal and Liu, 2008). Detecting fake opinions is important to ensure the quality of reviews in review platforms. Among of the most popular review hosting sites, Dianping is the largest one for Chinese reviews. It was founded in April 2003, in Shanghai, China and now has more than 100 million monthly active users, over 33 million reviews, and more than 10 million local businesses. To improve the quality of their reviews, they developed a system to identify spamming. It has been proven that review filtering system is highly accurate, which means that when the filter identifies a spam review it is almost certainly fake. We can trust the high precision due to two reasons. First, Dianpings system has been evolving as the spammers change their behaviors and they have a team of experts whose duty is to perform evaluations on their commercial 33

44 34 filter. Secondly, each detected fake review will be marked as deleted and a message will be sent to the reviewer. The low rate of complaints from user feedback regarding to the incorrect identification of their algorithm indicates that the high precision of the system is very reliable. However, the true recall of their system is unknown as experienced spammers can still compose deceptive reviews that are very hard to identify. This necessitates the need for a model that can learn from positive and unlabeled examples. Although PU learning has been shown to be effective in a wide range of text mining applications (Liu et al., 2003; Nigam et al., 2000; Liu et al., 2002; Dempster et al., 1977), little attention has been drawn to the task of fake review detection. It is partially due to the fact that positive labeled data is hard to obtain for this task and thus most supervised or semi-supervised learning approaches use pseudo fake reviews or ad-hoc fake reviews rather than real ones. With Dianping s golden standard review dataset, we are enabled to deeply investigate the underlying mechanism of opinion spamming and perform a supervised learning on the binary classification task. Each review in Dianping consists of its review ID, the date it was created, the ID of business it was written for, the reviewer s ID and IP address as well as a label indicating whether the review passes Dianping s fake review detection system. Inspired by the work from researchers who proposed graph-based models (Wang et al., 2011; Akoglu et al., 2013; Pandit et al., 2007), we believe that exploiting the subtlety of the correlations between users, reviews and IP addresses would achieve better prediction results. Thus we first propose a collective classification algorithm to identify fake reviews in our defined heterogeneous networks over users, reviews and

45 35 IP addresses and then extend it to a Collective PU learning model. The discovered positive reviews from unlabeled set can re-evaluate the spamicity score of users and IPs which will in turn further provide better predictions on reviews. Our experiments show that PU learning outperforms supervised learning significantly for both relational and non-relational classifiers. Most importantly, PU learning extracts a large number of hidden fake reviews which have not been detected by Dianping s system. This shows the effectiveness of PU learning as its advantage is to find hidden positive instances from the unlabeled instances even without negative training data. We conclude our contributions of this paper as three-fold: 1. Most existing work of spam detection is in English, this paper is the first reported work for Chinese reviews. Nevertheless, our proposed models are not tied to Chinese only as they are language independent. 2. Because the unknown class may include both positive and negative instances. Traditional supervised learning is not entirely suitable. We thus consider the unknown class to be unlabeled set and use PU learning learns to handle this case. To the best of our knowledge, this approach has rarely been used in fake review detection and it is in fact more appropriate for the problem. 3. Our work is based on over 9,600 reviews with high precision positive labels from a commercial review hosting website. We proposed a collective classification framework to solve this PU learning problem and compared our work with the state-of-arts. Experimental results demonstrate that our proposed model are more effective than the strong baselines.

46 36 Figure 15. Sample network of the users, reviews and IP addresses 4.2 Problem Definition In this section, we first introduce a few related notations and concepts. Then, we will formally define the collective positive and unlabeled learning problem in a heterogeneous network. A heterogeneous network is defined as G = (V, T, E), where V = {v 1, v 2,..., v V } is a set of nodes representing the entities of different types and E is the set of edges incident on V. e i,j E if v i has a link to v j. In our specific problem, we define our heterogeneous network as an undirected graph such that if e i,j E then e j,i E. T = {t 1, t 2,..., t V } is the set of corresponding entity types of nodes. Each t i is an element of a finite set of types Γ, i.e., t i Γ. For example, in our defined heterogeneous network, there are three types of nodes, users, reviews and IP addresses(or IPs for short). That is, Γ = {User, Review, IP}. The edges between nodes represent their dependency relationships. We did not consider restaurants as part of the network because their relations with other type of nodes are not very strong which we

47 37 Node Type t i Review User IP Node States S ti Fake(+), Truthful(-), Unlabeled(u) Spammer(+), Non-spammer(-) Suspicious(+), Organic(-) TABLE IV States of different entities. +, -, u are abbreviations for states will discuss later. Figure 15 schematically shows three types of nodes and some edges between them. Note that the relations between users and IPs are many-to-many but relations between reviews and users or between reviews and IPs can only be one-to-one because one review can only belong to one user and contains only one IP address. Each node v i is associated with a class label y i which is an element of a set of states S ti the node can be in with respect to its entity type t i. Thus we have y i S ti. In our review dataset, only positive labels for the some reviews are available thus we define the states of nodes in Table IV. A review has three states, Fake (positive class), Truthful (negative class), and a special state called Unlabeled. A user has two states Spammer and Non-spammer and IP also can be either suspicious or organic. It is not necessary true that all reviews written by spammers or from suspicious IP addresses are deceptive because spammers have a mixed behavior and IPs are usually shared by a large number of people. The states for users/ips stand for the most probable state that those users/ips are expected to be given their probabilistic estimates.

48 38 Symbol G V T E Γ v i t i S ti x i X y i Y A D Definition the heterogeneous network the set of nodes in the network corresponding types for nodes the edges incident on nodes the set of entity names the i-th node in the network entity type of node v i s.t. t i Γ states that node v i can be in given its node type t i observed feature vector for node v i observed feature matrix for all the nodes class label for node v i s.t. y i S ti class label assignments for all the nodes the set of indices for review nodes in the training set the set of indices for review nodes in the testing set TABLE V Important Notations We use i as the index to node v i as well as its other properties. Apart from its class label y i, each data intance v i contains observed features x i. So we define the feature space for all nodes as X and the class label space as Y. As we are solving the problem using a supervised learning approach, we split our dataset into two parts: training and testing. Let A be the set of indices of nodes in the training set and D be those in the testing set. Our task is to predict the class labels of reviews {y i i D, y i {+, }} in D given the input of the heterogeneous network G with only positive labels {y i i A, y i {+, u}}. In summary, we list the notations and definitions in Table V.

49 Collective Classification Models on Heterogeneous Networks In this section, we first introduce the classic Iterative Classification Algorithm(ICA) (Sen et al., 2008) and its applications. Then, we will describe our modified algorithm which is more appropriate for the collective classification problem on heterogeneous networks Collective Classification on reviews A conventional classifier is a function f that maps the observed feature matrix X onto the class label space Y. In our setting, unigrams and bigrams are features of reviews. We normalize those text features by converting them into the TF-IDF values. Feature vectors are usually independent to each other. However, ICA breaks the independence assumption. It is a commonly used approximate inference algorithm whose premise is very simple. Consider a node v i whose class label y i Y needs to be determined and suppose we have a neighborhood function N i (Equation 4.1) that returns the indices of its neighbors. M i is the class label vector of neighbors of node i which is derived from N i and the estimated class labels matrix Ŷ. ICA trains a local classifier f that takes in the observed features x i of v i as well as the estimated labels from its neighbors M i. Since many labels of nodes are unknown, this process has to be executed iteratively and in each iteration the label y i of each node is assigned with the current best estimates from the local classifier f. In our experiments, we use logistic Regression as our local classifier f. N i = { j e i,j E} (4.1)

50 40 M i = { ŷ j j N i, ŷ j Ŷ} (4.2) y i = f(x i, M i ) (4.3) However, the ICA algorithm is not directly applicable here because our network is in fact a multipartite graph in which nodes of same type are not directly connected. In our case, reviews are the type of nodes that we aim to classify and unfortunately reviews are not neighbors of each other. This calls the need of a model that can establish connections between reviews to help model take advantages of ICA. Grounded on common sense, we found a very strong intuition that reviews from the same user or IP address tend to have similar class labels. Because spammers are paid per review, they write as many spamming review as possible to maximize their profits. Sloppy spammers may use the same user account to write multiple reviews and more experience spammers own multiple accounts. In both cases, we assume spammers do not change users account or IP addresses very often. Then reviews sharing the same user or IP would have similar labels. But reviews of the same restaurant may not have direct impact on each other and that s why we exclude restaurant from the network. In order to encode the correlations between reviews via users and IPs, we reconstruct the neighborhood function as follows: N i = { j k e i,k E, e j,k E, t i = t j = Review} (4.4)

51 41 Algorithm 1 HCC on review nodes Input: G = {V, T, E}, A, D, X Y A = {y i i A, y i {+, }} Output: Y D = {y i i D, y i {+, }} N = Initialize(G, A, D, X, Y A ) while Ŷ not stabilized or maximal iterations have elapsed do //iterative steps M A = Ŷ = Predict(N, M A, A, D, X, Y A ) end while Output Y D = {ŷ i i D, ŷ i Ŷ} procedure Initialize(G, A, D, X, Y A ) Build N i from G for i A D using Equation 4.4 N = {N i i A D} X A = {x i i A} // Construct features matrix XA 1, X D 1 X D = {x i i D} // A and testing data D respectively Train a classifier f from features X A and labels Y A Ŷ = Y A // bootstrapping for i D do ŷ i = f(x i ) Ŷ = Ŷ ŷ i end for return N end procedure from training data procedure Predict(N, M A, A, D, X, Y A ) for i A do M i = { ŷ j j N i, ŷ j Ŷ} M A = M A M i end for Train a new classifier f from training data based on X A, Y A and neighbor assignments M A //update Ŷ Ŷ = Y A for i D do //estimate the label ŷ i for reviews in testing set ŷ i = f(x i, M i ) Ŷ = Ŷ ŷ i end for return Ŷ end procedure

52 42 If we simply treat the unlabeled set as negative data, we formulate the basic solution of the collective classification on reviews as shown in Algorithm 1. This idea is in fact similar to the HCC(Heterogeneous Collective Classification) model proposed in (Kong et al., 2012) when the length of meta-path is set to two. We also investigate that longer path will not achieve better results as the correlation becomes weak Multi-typed Heterogeneous Collective Classification This aforementioned approach only utilizes the features from all reviews and labels of reviews in the training set. Nevertheless, the valuable information of users and IPs are neglected in the previous model. As Mukherjee et al. (Mukherjee et al., 2013) point out that behavior features are strongly indicative clues for opinion spam, we want construct behavior features for users and IP addresses to amend the collective learning with reviews. There are two challenges for the collective classification on the nodes of users and IPs. On one hand, for the supervised learning problem, there are no labels for users or IPs. As opposed to reviews each of which is either fake or truthful, users and IPs appear in mixed behaviors making it extremely difficult and expensive to label and evaluate them. As a consequence, the classification task becomes very difficult. On the other hand, behaviors calibrated from reviews of 500 restaurants are not strong enough to reveal the pattern of spamming. To tackle these two problems, we first obtain the reviews toward all the Shanghai restaurants from the users or IPs in a two-year period of time to construct reliable behavioral features. Then we assign initialize the labels of users and IPs with the majority class label of their related review nodes. As the labels of reviews are off-the-shelf, we could easily compute ad-hoc labels for users and IPs.

53 43 We hereby enumerate the behavioral features for users/ips some of which come from (Mukherjee et al., 2013). 1. Maximum Number of Reviews per Day: It is abnormal to write so many reviews in a day as normal users can rarely write so many reviews in a particular day according to his personal experience. It also applies to IP addresses. 2. Total Number of Reviews: That is the number of all reviews from a user/ip. 3. Number of active days: The number of days in which the at least one review is submitted from the user/ip. 4. Average number of reviews per active day: It is defined as the total number of reviews divided by the number of days in which a user/ip is active. This is a important feature in that if someone writes too many reviews per day, he/she is clearly very suspicious. 5. Percentage of Positive review: The rating of a review is from 1 to 5. A positive review is generally considered rated as 4 or 5. Spammers are known to deliberately promote or demote some restaurant, so it is an indicative feature for spammers/suspicious IPs. 6. Average rating: the mean of ratings of all reviews from the user/ip. 7. Rating deviation: Rating deviation is an important metrics to evaluate spammers activities. As incorrect projection is achieved by misleading ratings, ratings of spammers are likely to deviate from ordinary users.

54 44 8. Average length of reviews: The length of a review is sum of the size of the word tokens of all sentences in a review. In the experiment section, our statistics in Table VI show that fake reviews are on average shorter than truthful reviews. 9. Maximum content similarity: We use cosine similarity as the metrics of the content similarity of a pair of reviews. The maximum content similarity is the largest one among all pairs of reviews of a user/ip. Fake reviewers may just modify a past review or only make some small changes to it. If someone copies his own reviews, he is definitely suspicious. We modify Algoritm 1 to incorporate the observed features from user and IP nodes. For simplicity, we use the superscripts 1, 2, 3 to represent the indicators of reviews, users and IPs respectively. For example, X 1 is the feature matrix for reviews, and X 3 means the behavioral features for IPs. We present the Multi-typed Heterogeneous Collective Classification(MHCC) in Algorithm 2. Note that since the new model utilizes information of neighbor of entities of all types, the neighborhood function here is computed from the union of Equation 4.1 and Equation 4.4. Both initialization and iterative prediction steps are different from HCC. Data instances in MHCC have richer relational features provide from the estimate of the classifier in the previous step Collective Positive-Unlabeled Learning Model In this subsection, we are going to show how to augment the collective classification model with the Positive-Unlabeled Learning framework to improve its performance. There are two main drawbacks of the MHCC model: 1) There are potentially deceptive reviews hiding in the unlabel reviews that Dianping s algorithm did not capture. MHCC simply

55 45 Algorithm 2 MHCC on all types of nodes Input: G = {V, T, E}, A, D, X 1, X 2, X 3 Y 1 A = {y i i A, y i {+, }}, Y 2, Y 3 Output: Y 1 D = {y i i D, y i {+, }} subscripts are entity types(1: review, 2: user, 3: IP) N, Ŷ = Initialize(G, A, D, X 1, X 2, X 3, YA 1, Y2, Y 3 ) while Ŷ not stabilized or maximal iterations have elapsed do M = Ŷ = Predict(N, M, A, D, X 1, X 2, X 3, YA 1, Y2, Y 3 ) end while Output Y D = {ŷ i i D, ŷ i Ŷ} procedure Initialize(G, A, D, X 1, X 2, X 3, YA 1, Y2, Y 3 ) Build N i from G for i {1, 2,..., V } using Equation 4.1, Equation 4.4 N = {N i i {1, 2,..., V }} XA 1 = {x i i A} // Construct features matrix XA 1, X D 1 from training data XD 1 = {x i i D} // A and testing data D respectively Train a review classifier f 1 from XA 1 and Y1 A Yˆ 1 = YA 1, ˆ Y2 = Y 2, ˆ Y3 = Y 3 //bootstrapping for i D do //estimate the label ŷ i for reviews in testing set ŷ i = f(x i ) Yˆ 1 = Y ˆ1 ŷ i end for return N, Ŷ end procedure procedure Predict(N, M, A, D, X 1, X 2, X 3, YA 1, Y2, Y 3 ) for i {1, 2,..., V } do M i = { ŷ j j N i, ŷ j Y ˆ1 Y ˆ2 Y ˆ3 } M = M M i end for Train f 1 from XA 1, Y1 A and M Train f 2 from X 2, Y 2 and M Train f 3 from X 3, Y 3 and M Yˆ 1 = YA 1, ˆ Y2 =, ˆ Y3 = for i {1, 2,... V } do k = t i //k is the type of node i ŷ i = f k (x i, M i ) Yˆ k = Y ˆk ŷ i end for return Ŷ end procedure

56 46 treats unlabel examples as negative. If there are still a large number of fake reviews in unlabel set, MHCC can be trained with data with wrong labels; 2) the ad-hoc labels of users and IPs are not very accurate as they are computed from labels of neighboring reviews. As the negative labels of neighboring reviews may actually be positive, the labels from users and IPss should be updated as more reviews are correctly classified. To address these two issues, we again change the iterative steps of MHCC to allow the labels of training to be modified by the model dynamically. New labels are updated with respect to the confident positives and negatives from all entities types. We call the new model Collective Postive-Unlabeled learning (CPU) model which is illustrated in Algorithm 3. The initialization procedure is same as Algorithm 2, the differences are in the iterative steps. The first difference of MHCC and CPU is that CPU allows initial labels to be violated if current probability estimate is strongly indicating an opposite prediction. This is especially useful for mining fake reviews that are identified by the collective classifier but passed Dianping s filter. The second difference is that CPU also uses testing data for training if their labels are reliable. It is impossible for non-relational classifier because testing data is now interacting with training data in the network, and it is also not applicable for conventional relational classifier since their truth label is unknown. 4.4 Experiment We now evaluate the proposed Collective Positive-Unlabeled learning model and compare it with several other baselines using real-life restaurant reviews from Dianping.com.

57 47 Algorithm 3 CPU on all types of nodes Input: G = {V, T, E}, A, D, X 1, X 2, X 3 YA 1 = {y i i A, y i {+, u}}, Y 2, Y 3, σ Output: YD 1 = {y i i D, y i {+, }} subscripts are entity types (1: review, 2: user, 3: IP) N, Ŷ = Initialize(G, A, D, X 1, X 2, X 3, YA 1, Y2, Y 3 ) //Initialization same as Algorithm 2 while Ŷ not stabilized or maximal iterations have elapsed do //iterative steps M = Ŷ = Predict(N, M, A, D, X 1, X 2, X 3, YA 1, Y2, Y 3 ) end while Output Y D = {ŷ i i D, ŷ i Ŷ} procedure Predict(N, M, A, D, X 1, X 2, X 3, YA 1, Y2, Y 3 ) for i {1, 2,..., V } do M i = { ŷ j j N i, ŷ j Y ˆ1 Y ˆ2 Y ˆ3 } M = M M i end for Train f 1 from XA 1, Y1 A and M Train f 2 from X 2, Y 2 and M Train f 3 from X 3, Y 3 and M Yˆ 1 = YA 1, ˆ Y2 =, ˆ Y3 = //update Y ˆ1, ˆ Y2, Y3 ˆ for i {1, 2,..., V } do k = t i // k is the type of node i ŷ i = f k (x i, M i ) Yˆ k = Y ˆk ŷ i end for Compute the mean µ k and σ k of probability distribution Y ˆk of all three types of nodes for i {1, 2,..., V } do k = t i // k is the type of node i if ŷ i > µ k + σ k then // reliable positive data update Y k with label y i set to + if i D then move i from D to A //hidden positive in testing reviews can be used for training end if else if ŷ i < µ k σ k then // reliable negative data end if end if end for end procedure update Y k with label y i set to if i D then move i from D to A //reliable negative in testing reviews can be used for training

58 Datasets and Language Characteristics As mentioned in the introduction, Dianping has a spam detection system to filter fake reviews on their website. Their algorithm has been evolving over time to adapt to the changes of spammers. Our goal is to help them identify hidden fake reviews that their system is still unable to capture and generalize more new features to adapt to the changes of experienced spammers. Fake reviews Unlabeled reviews Total No. of reviews No. of unique users No. of unique IPs No. of reviews per user No. of reviews per IP Avg No. of Chinese Characters Avg No. of Chinese Words TABLE VI Statistics of the 500 restaurants in Shanghai Our experiment dataset consists of reviews from 500 restaurants in Shanghai city. Since the system is evaluated to be robust and mature in recent years, we only used the reviews of those restaurants between November 1st, 2011 and November 28th, Table VI shows the statistics of our 500 restaurants dataset. It is interesting to see that among the fake reviews, the

59 49 number of reviews per IP is almost twice as many as that of unlabeled ones, which indicates that IPs associate spammers are more active than IPs with organic users. So it could explain why by using network efforts the relational classifier can improve a traditional local classifier. Another interesting finding is that fake reviews identified by Dianping s algorithm are averagely a little bit shorter than unlabeled reviews. It might be due to the fact that non-spammers compose truthful reviews based on their real experience while in contrast, it is harder for spammers to make up long stories without truthful experience. Compared with English reviews, reviews in Chinese are inherently much shorter. Reviews in our dataset contain 85.5 words on average while reviews in Yelp s data challenge 1 have an average length of words. Furthermore, unlike English words which are delimited by white spaces or punctuation, Chinese words in a sentence are not separated by spaces which can be more ambiguous. In order to split Chinese text into a sequence of words, we choose to use a word segmentation tool called Jieba 2. Jieba is implemented based on Hidden Markov Model (Sun et al., 2009) and the task of Chinese words segmentation is quite similar to the task of shallow parsing or chunking Experiment Settings Training and Testing Since our dataset contains two classes, positive and unlabeled data, for non-pu learning models, we treat unlabeled review as negative data in both training and testing phases. How

60 50 ever, as for our proposed Collective Positive-Unlabeled learning model and other PU learning baseline models, we treat unlabeled data as unknown in training but in testing we assume unlabeled instances are negative. Our results are obtained through 10-fold cross-validation with data instances randomly sorted Evaluation Metrics The effectiveness of the proposed model was evaluated by means of the standard Accuracy (A), Precision (P), Recall (R) and F1-score (F1) for the binary classification task. The accuracy measure is simply the percentage of corrected identified instances (both fake and non-fake). All the other metrics (precision, recall and F1) are based on the positive class because review hosting company is only interested in identifying deceptive reviews for improving the quality of reviews in their website Compared Methods We now list the baseline work from related works and compare them with our proposed models. Logistic Regression(LR): Logistic regression is a traditional discriminative model on flat data. We use logistic regress because it produces a probability estimate of each review of being in the positive class (in our case the fake review class) efficiently. It also serves as the base learner for other relational classifiers that will be discussed later including our proposed model. Mukherjee et al. (Mukherjee et al., 2013) applied Support Vector Machines (SVM) to the Yelp s reviews. However, SVM does not naturally produces a probability output. Although there has been some research on generating probability

61 51 estimates for SVM (Wu et al., 2004), it is in fact much more computationally expensive than LR. As the base learner are called iteratively for multiple times in the relational classifiers, we choose to use LR rather than SVM. We also conducted experiments using LR and SVM on the same set of text features and both classifiers yield similar results. PU LEArning (PU-LEA): To our best knowledge, this is so far the first PU Learning algorithm that has been employed to spam detection. Hernández et al. (Hernández et al., 2013) proposed PU-LEA algorithm which iteratively removes positive data instances from unlabeled ones. Their algorithm treats the unlabeled reviews as negative data. The trained classifier would uncover hidden positives in the unlabel set gradually which in turn could help train a new classifier with updated labels. Their work becomes our compared baseline in the PU learning setting however without network relations explored. Spy-EM and Spy-SVM (Li et al., 2014b): Liu et al. (Liu et al., 2003) built a series of classifiers for Learning from Positive and Unlabeled data (LPU). Their model follows a two-step approach. In the first step, from the unlabeled set, they identify a collection of reliable negative instances using a Spy technique (Liu et al., 2002). Then in the next step, they iteratively apply a classification and select a good classifier from the set. In the second step, there are two variations, Expectation Maximization(EM) (Dempster et al., 1977) and Support Vector Machines(SVM). EM first builds a Naïve Bayes classifier (Nigam et al., 2000) with positive examples and only reliable negative examples. Then it alternate the E-step and M-step to obtain a final classifier after it converges. SVM works similarly. See (Yu et al., 2002) for details. The two models are made available under

62 52 the LPU software 1. We compared their models with our Collective PU learning model to show that incorporating the relations between users, IPs and reviews could improve the performance. Heterogeneous Collective Classification (HCC): Kong et al. (Kong et al., 2012) proposed a heterogeneous classifier replying on meta-path (Sun et al., 2011). Their model captures the different types of dependencies of various types of entities. Their reported experiment was performed on ACM Conference Dataset to predict the ACM index term of the paper through the paper-author-institute-conference network. It turns out their model are more effective than a set of strong baselines including Iterative Collective Algorithm (ICA) (Lu and Getoor, 2003), Combined Path Relations (CP) (Eldardiry and Neville, 2011), Collective Ensemble Classification (CF) (Eldardiry and Neville, 2011). Here, we compare it with our proposed MHCC model and the CPU model. Multi-typed Heterogeneous Collective Classification (MHCC): This model is our proposed supervised learning model for heterogeneous network. The key difference between MHCC and HCC is that HCC reduces a heterogeneous network to a homogeneous network of target node types and encodes the complex relations into a simple relation between nodes of the same type, while MHCC retains the heterogeneous network and models more complex relations across different entities. It is shown in the experiment results that MHCC could capture more subtle correlations between nodes of different types. 1

63 CPU MHCC HCC SpySVM SpyEM SPU LR F1 Precision Recall Accuracy Figure 16. Evaluation of Different Models based on 10-fold cross validation. Note: although Spy-EM has the highest recall, it over-predicts the positive class and hence the lowest precision. Collective Positive-Unlabeled learning (CPU): This is the final model that combines PU learning and MHCC. It not only encodes the mutual dependencies of different entities but also iteratively extracts reliable positive and negative nodes from the heterogeneous network. Clearly, it outperforms all the other models Model Comparison We evaluate the effectiveness of our proposed models on the collective classification task. 10-fold cross validation is performed for each model. Accuracy, precision, recall and F1 are reported in Figure 16. First of all, all three collective classification models that explicitly explore the dependencies from various aspects outperform non-relational classifiers which classify each instance independently. This finding supports the fact that collective classification classifiers

64 54 CPU MHCC HCC SpySVM SpyEM SimplePU LR F Percentage of training data Figure 17. F1-score of all models given various training size are superior as they take into account the labels of neighboring nodes. More specifically, if a review is fake, reviews from the same user or IP tend to be suspicious. Secondly, PU learning models have higher recall than other supervised learners despite some loss at precision. This is due to the fact that Dianping s unlabeled data points certainly contain hidden positives which could provide a purer positive and negative instances for training. Last but not least, our proposed models, MHCC and CPU outperform the strong baseline HCC with higher precision and recall. It is not surprising that MHCC beats HCC as it incorporates more correlations from different homogeneous networks that are derived from the original heterogeneous networks. Again, CPU could further improves MHCC in that more positive and negative users, reviews and IPs are correctly identified which in turn can train more accurate classifier. Fake reviews that are not filtered by Dianping s system could provide better assessment of their associated users and IPs, and similarly suspicious users and IPs can disclose more hidden

65 CDF 0.85 suspicious IPs organic IPs No. of fake reviews Figure 18. CDF of the number of fake reviews for suspicious IPs v.s. organic IPs fake reviews. Exploiting the complex structures from the network, MHCC and CPU effectively identify many false negatives that LR and HCC can not find. The use of labels of neighboring objects can largely improve the precision as opposed to other PU learning models and the base learner LR. Obviously, the more training data, the better performance collective classification classifiers can give. Then it is natural to think of comparing those relation classifiers with local classifiers given different sizes of training data. Figure 17 shows the F1-scores of different models varying the size of training data from 10% to 90%. It can be read from the chart that when training data is less than 20% most local classifiers (both LR and PU learning ones) do a better job than most relational ones. But when given more than 40% training data, relational classifiers are significantly improved. It is worth noting that our proposed Collective PU learning still achieve a relatively good results even only 10% of training data is provided. This sheds light

66 CDF 0.85 suspicious IPs organic IPs No. of spammers Figure 19. CDF of the number of spammers for suspicious IPs v.s. organic IPs on the effectiveness of PU learning in the collective classification setting as our CPU model most effectively uncovers implicit positives from unlabeled data points and iteratively enhances itself Posterior Analysis From the experimental results we can see that our predictions are quite accurate, we want to perform some more analyses on the results to gain a good understanding of the models we proposed and figure out why they works. In addition to the probability estimates for fake reviews, our proposed model MHCC and CPU can also produce the estimation for other types of nodes in the heterogeneous network, i.e., users and IPs. However, labels of users and IPs are very hard to obtain. This is because unlike reviews which are either fake or non-fake, users and IPs demonstrate mixed characteristics. For example, spammers can also write truthful reviews according to their personal experience. User accounts sometimes may be compromised to write

67 57 fake reviews but it does not necessarily implies they are malicious all the time. Moreover, many users under the same IP are possibly a mixed set of spammers and organic users. Thus their labels are very hard to decide. As we could seen in Table VI, there s a great distinction between the number of fakes reviews per IP and the number of unlabeled reviews per IP. Malicious IPs tend to be more active than normal IPs. So we decide to illustrate how disparate suspicious IPs and organic IPs are. We use 0.5 as a probability threshold to categorize IPs into suspicious ( 0.5) and organic(< 0.5). In Figure 18, the CDFs (Cumulative Distribution Function) of the number of fake reviews connected to suspicious and organic IPs reveals a great difference in their connection to fake reviews. Most IPs connect to less than 5 fake reviews while still there are some number of IPs are linked to more than 10 fake reviews. Quite similarly user can be categorized into spammers and non-spammers in terms of the degree they post deceptive reviews. We plot the CDFs of the number of spammers linked to suspicious and organic IPs in Figure 19 to show that spammers usually form groups under suspicious IPs. These findings support our intuition that through the iterations in collective positive and unlabeled learning, better estimation of the node reinforces the more accurate prediction of labels of reviews, users and IPs. Those discovered users and IPs not only help to identify hidden deceptive reviews in the unlabeled dataset but also enable Dianping to take actions to those users and IPs whose spam score are extraordinary high. In each iteration, our proposed CPU model learns better estimates of the suspicious users and IPs. Correctly identified hidden positive users and IPs could help retrieve more hidden positive reviews thus making our model achieve much better performance.

68 58 Predicted Positive Predicted Negative Actual Positive TP FN Actual Negative FP TN TABLE VII Confusion matrix: Positive is the fake review, Negative is the unlabeled review. TP: True Positive, FP: False Postive, FN: False Negative, TN: True Negative Short name(a/b) Meaning(A/B) HCC/CPU MHCC/CPU 1/4 TP/FN 5 3 4/1 FN/TP /3 TN/FP 7 6 3/2 FP/TN TABLE VIII Disagreement counts of a pair of models(a/b). The disagreement counts are based on all the nodes of a connected component in entire network. Next, we want to use a small segment of our heterogeneous network to showcase the effectiveness our proposed model in improving both precision and recall. Recall and precision are calculated using the confusion matrix in Table VII. Note that in testing we have to treat unlabeled reviews as negative because we have no idea of which unlabeled reviews are actually fake but it is fair for all models because they are evaluated in the same way. Beyond the actual value of precision, recall we would like to investigate the agreement and disagreement of different models to demonstrate the mechanism behind the scene. For the simplicity and ease of visualization, we encode the combinations of disagreement of two models A and B with their

69 /2 42 3/ /2 3/ /2 4/1 2/3 3/2 4/1 4/1 0 4/1 3/2 5 3/2 3/ /1 4/ /1 4/1 14 3/2 4/1 24 1/ / /2 3/ Figure 20. Disagreement of HCC and CPU on4 a small segment of one of the 6 1 connected component of the entire 3/2 heterogeneous network. Blue nodes are IPs, Red nodes are users and black ones are reviews. Labels of IPs are sequence numbers, labels are users are left empty. Labels of reviews are short names listed in Table Table VI elsewise they are set to empty. 0 4/ / /3 4/1 4/1 40 3/2 4/1 9 4/1 1/4 26 3/ /1 3/ /1 3/2 2/ / / /1 4/1 45 4/ /2 1/4 3/2 54 2/3 4/1 2/3 51 4/1 4/1 4/1 33 4/1 3/2 2 4/ /1 3/2 22 4/1 3/2 3/ / /1 37 2/3 4/ /2 3/2 20 2/ / /1 1/ /2 3/2 18 3/2 2/3 1/4 48 3/2 3/2 3/ / /1 12 4/1 5 Figure 21. Disagreement of MHCC and 3/2 CPU on4 a small segment of one of the 6 connected component of the entire heterogeneous network. Blue nodes are IPs, Red nodes are users and black ones are reviews. Labels of IPs are sequence numbers, labels are users are left empty. Labels of reviews are short names listed in Table Table VI elsewise they are set to empty. short name a/b listed in Table VIII. Numbers 1-4 represent TP, TN, FP,FN respectively and the last two columns indicate their disagreement counts. Compared with HCC and MHCC our proposed CPU model not only identifies more FNs as TPs but also corrects a large number of FPs. Figure 20 and Figure 21 visualize a small fragment of Dianping s heterogeneous network. In Figure 21, IP 26 is mostly like an organic IP. It propagates its belief to IP 41 via a shared user who is also very likely to be a non-spammer. This helps the CPU model to balance the text features and the relation features and eventually correct the fake positives predicted by the HCC model in which the user and IP nodes are not modeled. Figure 21 compares CPU

70 60 with MHCC. The ground truth is that IP 51 is definitely suspicious. However, the initial label estimate of IP 51 is organic because labels of many reviews associate with IP 51 are unknown. The strength of CPU is to iteratively extract positives from the unlabeled data updating the estimates of all kinds of nodes, the classifier learned in the next iterator is provided with more accurate relation features and is able to make better predictions for further steps. Empowered with PU learning, CPU model can discover more fake negatives that MHCC can not capture. This demonstrates the power of PU learning in relational classifiers.

71 CHAPTER 5 DETECTING CAMPAIGN PROMOTERS ON TWITTER (This chapter includes and expands on my paper previously published in Huayi Li, Arjun Mukherjee, Bing Liu, Rachel Korneld, and Sherry Emery. Detecting campaign promoters on twitter using markov random fields. In ICDM 2014.) 5.1 Introduction As Twitter has emerged as one of the most popular platforms for users to post updates, share information, and track rapid changes of trends, it has become particularly valuable for targeted advertising and promotions. Since tweets can be posted and accessed from a wide range of web-enabled services, real-time propagation of information to a large audience has become the focus of merchants, governments and even malicious spammers. They are increasingly using Twitter to market or promote their products or services. On the research front, researchers have regarded Twitter as a sensor of the real world and have conducted numerous experiments and investigations on a variety of tasks including analyzing mood and sentiment of people (Liu, 2012), detecting rumors (Qazvinian et al., 2011; Gupta et al., 2012), detecting twitter spammers (Benevenuto et al., 2009; Grier et al., 2010; Thomas et al., 2011), correlating Twitter activity with stock market (Bollen et al., 2011a; Bollen et al., 2011b; Ruiz et al., 2012), predicting presidential election (Tumasjan et al., 2010), forecasting movie box revenues (Asur 61

72 62 and Huberman, 2010), modeling social behaviors (Liu et al., 2013) and influence (Shuai et al., 2012). In this chapter, we aim to solve the problem of detecting user-accounts involved in promotional campaigns, more specifically, to identify promoter accounts and non-promoter accounts in Twitter on a particular topic. Promoters often influence peoples in a hidden or implicit manner without disclosing their true intention. They even deliberately try to hide their true intentions. The readers are thus often unaware that the posts they see are misleading ones to make them purchase some target products or accept some ideas. The readers may think those posts are just ordinary posts from random members. It is thus important to identify such campaigns, promoters. This discovery is clearly useful for businesses and organizations. For example, any business would want to know whether its competitors are carrying out secret campaigns on Twitter to promote their products and services (and possibly also making negative remarks/attacks about its own products/services). It also contributes to research in growing fields like opinon spam (Jindal and Liu, 2008) and deception (Mukherjee et al., 2013). However, by no means do we say that all campaigns on Twitter are bad or are spam. For example, a government health agency may conduct an anti-smoking campaign on Twitter to inform the general public the health risks of smoking and how to quit smoking. In this case, the agency would want to know how effective the campaign is and whether the general public is responding to the campaign and even helping the campaign by propagating the campaign messages and campaign information web sites or pages. In fact, our research is motivated by a real-life application and a request by a health research program, which studies smoking related

73 63 activities on Twitter. In the field of health science, more and more researchers are measuring public health through the aggregation of a large number of health related tweets (Paul and Dredze, 2011). The campaigns studied in our work are three health related campaigns about smoking. After nearly five decades since the first US Surgeon Generals Report on Smoking and Health was released, it is estimated that about 443 thousands Americans still die each year from smoking-related diseases. Thus it is critical to provide health scientists and government health agencies with clean feedback from the general public. They can then use the feedback to perform health and policy related studies of activities and tweets of Twitter users, to understand the effectiveness of health campaigns, to make better decisions and to design more appropriate policies. Thus, our goal is to classify two types of user accounts, those involved in promotion and those not involved in promotion. Due to the fact that Twitter only allows 140-character-long messages (called tweets), they are often too short for effective promotion of targeted products/services. Promotional tweets typically have to include URLs pointing to the full messages, which may include pictures, audios and videos (the URLs are typically shortened too). Note that we do not study opinion spamming in this work, which refers to posting fake opinions about brands and products in order to promote them or to demote them. Such posts often do not contain URLs. For opinion spamming, please refer to (Fei et al., 2013; Mukherjee et al., 2013). Probably, the most closely related work to ours is that in (Benevenuto et al., 2009), but it is in the YouTube video domain and their video attributes are not directly applicable in our problem. This paper formulates detecting promoters as a classification problem to identify

74 64 promoters and non-promoters. Although traditional supervised classification is an obvious approach, we argue that it is unable to fully exploit the rich context of the problem. As we will see in the experiment section, the traditional classification approach adapted to our context produces markedly poorer results than our proposed T-MRF approach. By rich context, we mean tweet content, user behavior, social network effect, and burstiness of postings. Due to the social network effect, user accounts are not independent. In fact, we found that many promoter accounts are related to each other via following relations. They are also implicitly related due to content similarities of their tweets. Furthermore, they may be related because they posted at roughly the same time, resulting in bursts of posts. Additionally, if tweets from some user accounts all include the same URLs, they may also be related. Thus, the i.i.d (independently and identically distribute) assumption on the instances in traditional classification is violated. To capture these sophisticated characteristics of campaign promoters, the underlying infrastructure, and the rich context, we model this problem using Markov Random Fields (MRF). Traditional MRF uses one type of nodes in the graph. However, in our case, we have multiple types of nodes, which affect each other in different ways. We thus extend MRF to typed-mrf (or T-MRF). T-MRF generalizes the classic MRF, and with a single type of nodes, T-MRF reduces to MRF. T-MRF allows us to flexibly specify propagation matrices for different types of nodes. The type here refers to the node type, e.g., user, URL or burst. We then use the Loopy Belief Propagation method (Murphy et al., 1999) to perform inference, i.e., estimate each user node s belief(probability) of being in the promoter/non-promoter category.

75 65 Our experiments are conducted using three Twitter datasets shared by our collaborators in health science. Two datasets are about two known anti-smoking campaigns conducted by the Centers for Disease Control and Prevention (CDC), a government health agency in the USA, and one dataset is about electronic cigarettes (or e-cigarettes) promotions on Twitter. Our algorithm can accurately classify promoters and normal Twitter users in all three datasets. From the e-cigarettes dataset, we found that there are numerous promotions going on in Twitter. They mainly promote different brands of e-cigarettes. Such activities have long been suspected by health researchers. Our results thus demonstrate the effectiveness of the proposed T-MRF model, which outperforms several baselines markedly. Our analysis of the results also shows some interesting differences of the two types of campaigns. 5.2 Promoter Detection Model This section presents the typed-mrf (T-MRF) model for detecting promoters. They thus are strongly related to each other. The standard approach to classify each entity independently ignores these relations. We thus formulate our promoter detection problem with Markov Random Fields (MRF), which are well suited to such relational classification problems. To our knowledge, this is the first attempt to employ MRFs for solving the campaign promoters problem in Twitter. To apply the standard MRF for our problem, however, is not sufficient. We thus extend it to typed-mrf (T-MRF). Below, we first introduce the basic MRF model and its inference algorithm, and then generalize it to T-MRF in order to solve our problem in a flexible way.

76 Markov Random Fields Markov Random Fields (also called Markov Networks) is an undirected graphical model. It is designed to infer unobserved hidden variables given observed data. MRF works on an undirected graph G = (V, E), where each vertex or node v i V represents a random variable. An edge (v i, v j ) represents a statistical correlation of the pair of variables indexed by i and j. A list of potential functions are used to measure compatibility among involved nodes in a clique within the graph. Each random variable can be in any of a finite number of states S and is independent of other random variables given its immediate neighbors. The inference task is to compute the maximum likelihood assignment of states of nodes. The states here are our classes in classification. A subclass of Markov Random Fields that arises in many contexts is the Pairwise Markov Random Fields (pmrf). Instead of imposing potential functions on large cliques, the potential functions in pmrf are over single variables and pairs of variables (or edges). We use ψ i (σ i ) to denote the potential function on a single variable (indexed by node i), indicating the prior belief that the random variable v i is in state σ i. We also call it the prior of the node. We use ψ i,j (σ i, σ j ) to denote the potential that node i in state σ i and node j in state σ j for the edge of the pair of random variables (v i, v j ). Each potential function is simply a table of values associated with the random variables. Due to its simplicity and efficiency, pmrf is widely used in applications. Thus we choose to use pmrf in this work. For simplicity of presentation, in the subsequent discussion, when we use MRF we mean pmrf.

77 Loopy Belief Propagation The inference task in the Pairwise Markov Random Fields is to compute the posterior probability over the states/labels of each node given the prior state assignments and potential functions. For specific graph topologies such as chains, trees and other low tree-width graphs, there exist efficient algorithms for exact inference. However, for a general graph, the exact inference is computationally intractable. Therefore approximate inference is typically used. The most popular algorithm is the Loopy Belief Propagation algorithm, which is from Belief Propagation. Belief Propagation was first proposed by (Pearl, 1982) for finding exact marginals on trees. It turns out the same algorithm can be applied to general graphs that contain loops (Yedidia et al., 2000). The algorithm is thus also called Loopy Belief Propagation (LBP). However, LBP is not guaranteed to converge to the correct marginal probabilities. But recent studies (Murphy et al., 1999) indicate that it often converges and the marginals are a good approximation to the correct posteriors. The key idea of LBP is the iterative message passing. A message from node i to node j is based on all messages from other nodes to node i except node j itself. The following equation gives the formula for message passing: m i j (σ j ) = z 1 ψ i,j (σ i, σ j )ψ i (σ i ) m k i (σ i ) (5.1) σ i S k N(i)\j

78 68 where z 1 is the normalization constant and σ j is one component of the message m i j (σ j ) which is proportional to the likelihood that node j is in state σ j given the evidence from i in all possible states σ i. N(i) is a function that returns all the neighbors of node i. The above equation is called the sum-product algorithm because the inner product is over the messages from other nodes to node i and the outer summation sums over all states that node i can take. At the beginning of LBP, all messages are initialized to 1. Then, the messages of each node from its neighbors are alternately updated until the messages stabilize or a maximum number of iterations threshold is reached. The final belief b i (σ i ) of a node i is a vector of the same dimension as the message that measures the probability of node i in state σ i. The belief of node i is the normalized messages from all its neighbors as shown below, where z 2 is the normalization factor that ensures σ i b i (σ i ) = 1. b i (σ i ) = z 2 ψ i (σ i ) m k i (σ i ) (5.2) k N(i) T-MRF We now extend MRF because we need to consider multiple types of nodes. Based on such different node types, the interactions or dependencies among the nodes (or random valuables) are also different. For example, in our problem, there are clearly two main types of entities: users and URLs, which are our nodes. We also introduce bursts as another type of nodes. When promoters promote some URLs, they often do in bursts due to pre-planned campaigns. That is,

79 69 Figure 22. Burstiness of CDC 2012 campaign dataset campaign organizers periodically drive the campaign by sending a large number of tweets, which results in a sudden increase of tweets related to a topic in a short period of time. Figure 22 shows burstiness in one of our datasets. We define some important peaks as the third type of entities and called them bursts. The reason that we use peaks or bursts is that users within the same burst may have some relationships(e.g., latent sockpuppets, deliberative/coincidental collusion by users, etc.). Different types of nodes also have different states. For example, for the three types of nodes in our case, we have: A user is either a promoter or a non-promoter. Thus, each user node has the two possible states.

80 70 Symbol V E T H v i t i S ti ψ i (σ i t i ) ψ i,j (σ i, σ j t i, t j ) m i j (σ j t j ) b i (σ i t i ) Definition Set of nodes in the graph Set of edges in the graph Mapping from nodes to node types Set of types of nodes The i-th node or random variable in the graph Type of node i, t i H Set of states node i can be in Prior of node i in state σ i Edge potentials for node i of type t i in state σ i and node j of type t j in σ j Message from node i to node j expressing node i s belief to node j being in state σ j Belief of node i in state σ i TABLE IX Important Notations An URL is either a promoted or organic URL. Each URL node thus has these two possible states. A burst is either a planned or normal burst. A burst means that some topic gets popular suddenly. As we indicated in the introduction, due to the relatedness of these nodes, the probability of one node in a particular state is influenced by the state probabilities of the other associated nodes. For example, if one user has a higher probability being a promoter, then the URLs in his tweets are likely to be promoted URLs. Likewise, the burst that he is in is likely to be a planned burst. Such relationships can be modeled in the T-MRF.

81 71 Figure 23. A simple example of User-URL-Burst network Motivated by the above intuition, we now present the proposed T-MRF model. T-MRF basically defines a graph with typed nodes. Each type of nodes represents a type of entity of interest, e.g., user, URL, or burst in our case. Table IX summarizes the definitions of symbols that we will use. Our typed graph is represented by G(V, T, E), where V = {v 1, v 2,..., v n } is a set of nodes representing a set of random variables and E is a set of edges on V. T = {t 1, t 2..., t n } is the set of corresponding node types of the nodes in V. Each t i is an element of a finite set of types H, i.e., t i H. For example, we use three node types in this work, i.e., H ={User, URL, Burst}. The edges between nodes represent their dependency relationships. Figure 23 schematically shows three types of nodes and some edges between them. As we will see later, we can also add edges between dependent users. Each node v i representing a random variable in T-MRF and is associated with the set of states denoted by S ti with respect to its node type t i. For instance, in our case, if t i = user, then S ti = {promoter, non-promoter}. The state σ i S ti that each node is in depends on its

82 72 observed features as well as its neighboring nodes in the network. In order to capture these dependencies, we define two kinds of potential functions, the node potential ψ i (σ i t i ) and the edge potential ψ i,j (σ i, σ j t i, t j ). ψ i (σ i t i ) is the prior belief of the node v i of type t i in state i, which is measured by its own behavior and content features. The edge potential for a pair of nodes, also called the edge compatibility function, gives the probability of a node v j of type t j being in the state σ j given its neighboring node v i of type t i in state σ i. For each pair of node types, the edge potentials between the two types of nodes are represented as a propagation matrix, which is used in the loopy belief propagation algorithm (LBP). The message passing assignment equation of LBP now becomes: m i j (σ j t j ) = z 1 ψ i,j (σ i, σ j t i, t j )ψ i (σ i t i ) m k i (σ i t i ) σ i S k N(i)\j (5.3) The final belief b i (σ i t i ) of a node i of type t i is a vector of the same dimension as the message that measures the probability of node i of type t i in state σ i. b i (σ i t i ) = z 2 ψ i (σ i t i ) m k i (σ i t i ) (5.4) k N(i) In summary, adding node types in T-MRF allows each type of nodes to have a different set of states, and enables the user to specify the potentials based on the types of two nodes

83 73 in a node pair. T-MRF thus generalizes MRF because when there is only one type of nodes, T-MRF reduces to MRF. 5.3 T-MRF For Promoter Detection We now detail how to apply the T-MRF model to our application. Below, we first give the types of nodes, and edges potentials and then node potentials Node Types Users: These are all the user accounts ids in a dataset. URLs: These are the set of all URLs mentioned in the dataset. Most URLs in Twitter are shortened URLs. We use their expanded URLs instead because multiple different shortened URLs may be mapped to the same expanded URL. Bursts: In our setting, a burst is a particular day when the volume of tweets suddenly increases drastically. To detect bursts, we first generate a time-series of tweets based on the number of tweets per day and apply the peak detection algorithm in (Du et al., 2006) to find bursts Edge Potentials Since we have three types of nodes, we can have 6 kinds of edges: user-url, user-burst, URL-burst, user-user, burst-burst, and URL-URL. However, we only find the following four kinds of edges useful: user-url, user-burst, URL-burst, and user-user. We now define the edge potentials for these four types of edges. The parametric algebraic formulations for node potentials (Table X) were derived using our pilot experiments based on the relations explained below. In the experiment, we report results for different values of ɛ to measure its sensitivity.

84 74 t j = URL t i = User promoted organic promoter 1 2ɛ 2ɛ non-promoter 2ɛ 1 2ɛ (a) t j = Burst t i = URL planned normal promoted ɛ 0.5 ɛ t j = Burst t i = User planned normal promoter ɛ 0.5 ɛ non-promoter 0.5 ɛ ɛ (b) t j = User t i = User promoter non-promoter promoter ɛ 0.5 ɛ organic 0.5 ɛ ɛ non-promoter (c) (d) TABLE X. Propagation matrix ψ i,j (σ i, σ j t i, t j ) for each type of edge potentials User-URL Potentials: A user and a URL form an edge if the user has tweeted the URL at least once. This kind of edges is useful because campaign promoters reply on the URLs they tweet to lead other Twitter users to the target websites. If a URL is heavily promoted, the users who tweet the URLs are likely to be promoters. On the contrary, URLs that are relatively less promoted are usually mentioned by non-promoters. URLs in the tweets of promoters are called promoted URLs. Non-promoters who learned the campaign through external sources such as news, TV and other websites are less likely to collaborate with promoters on targeted URLs. But non-promoters can have promoted URLs in their tweets due to the influence of the social media campaign. Furthermore, campaign promoters are more interested in their target URLs than URLs from other websites. The edge potentials for this kind of edges are given in Table X(a), which is expressed as a propagation matrix to be used by LBP. The values in the matrix are set empirically.

85 75 User-Burst Potentials: A user and a burst form an edge if the user posted at least a tweet in the burst. The arrival of a large number of tweets forming a burst is either a natural reaction to a successful campaign or a deliberate promoting activity from real promoters and/or their Twitter bots. We assume planned bursts contain primarily promoters while normal bursts are mostly formed by normal users who are attracted by the campaign. Thus the user-burst relation can help identify groups of promoters. The edge potentials for this kind of edges are given in Table X(b), which are also expressed as a propagation matrix. URL-Burst Potentials: A URL and a burst form an edge if the URL has been tweeted at least once in the burst. To maximize the influence of a campaign, campaign promoters have to continuously post tweets to maintain the advertising balance for URLs of interest. Similar to User-Burst potentials, URLs mentioned within a planned burst are likely to be promoted while URLs in a normal burst are likely to be organic. The edge potentials for this kind of edges are given in Table X(c), which are again expressed as a propagation matrix. User-User Potentials: Several user accounts could be potentially owned by the same individual or institution(e.g. sock-puppet). Rather than working alone, campaign promoters can be well organized (note that sending tweets from individual accounts aggressively would result in account suspension by Twitter according to Twitter posting policy 1. A group of campaign accounts who work collaboratively can attract more audience and increase their 1

86 76 credibility. Not considering the group of accounts collectively, it is difficult to detect some individual promoters because of insufficient features. First of all, campaign promoters are inclined to send predefined tweets that are similar in contents. Two users are similar if their tweet Content Similarity (CS) is high. With the bag of words assumption, we treat each tweet as a vector and each user as an averaged vector of all his/her tweets. Note that as retweets are merely duplicates of original tweets, we generally discard them in measuring content similarity. Then we use cosine similarity to measure the similarity of tweets of two users. CS i,j = cosine (avg(tweets i ), avg(tweets j )) (5.5) Secondly, promoters are only concerned with their own products or events thus they tweet only a small set of URLs for their own benefits. Let r i and r j be the sets of URLs that are mentioned in the tweets of user i and user j respectively. The URL Similarity (US) of two users is measured by Equation 5.6 in terms of Jaccard coefficient. US i,j = r i r j r i r j (5.6) (Ghosh et al., 2012) showed that to have larger audience, to increase the perceived influence of their accounts and to impact the rankings of their tweets, promoters may acquire followers either by establishing mutual following links between themselves or targeting (following) other normal users who would then reciprocate out of social etiquette (Gayo Avello and

87 77 Brenes Martínez, 2010). So another important measure of user similarity is the Following Similarity (FS). Let f i and f j be the sets of users followed by users i and j respectively. Equation 5.7 gives the Following Similarity of two users. F S i,j = f i f j f i f j (5.7) Eventually, we define the similarity of a pair of users as the average of above-mentioned similarity measures (Equation 5.5, Equation 5.6 and Equation 5.7). The three similarity measures are used to model the dependency between users whose connections do not exist in the original graph. If the similarity of the two users is higher than some threshold, we add a user-user edge between them. Intuitively, if a user is connected with a promoter, then he/she is also likely to be a promoter. Therefore, the corresponding user-user propagation matrix is defined in Table X(d) Node Potentials: Prior Belief Estimation Node potentials or prior beliefs of different nodes in the network are important as we will see in our experimental results in that they can help the propagation algorithm converge much faster. This section details our approach to estimate the prior beliefs of the states that users, URLs and bursts are in. The estimated probabilities can help to guide our proposed model to learn more accurate posterior probabilities of all nodes in the network.

88 78 User Prior: We use supervised classification to compute the state priors for each user node. Since promoters and non-promoters have different goals, they differ greatly on how they behave. Similar to (Benevenuto et al., 2010), we define a set of content and behavior features for each user. We want to learn a local classifier from a set of labeled users to estimate the state probability distribution for the rest of the unlabeled users. The content features include the number of URLs per tweet, number of hashtags per tweet, number of user mentions per tweet, percentage of retweets for each user. These features are important attributes that distinguish promoters from non-promoters. As the goal of promoters is to promote, they tend to provide as many URLs as possible in a tweet that are pertaining to their target events or products. Therefore, the number of URLs per tweet can discriminate promoters from normal users. Hashtags are another type of important indicators as they are often used in the twitter trends. There also exist promoters who send unwanted messages to target users by mentioning their usernames in the tweets. Therefore, the abuse of user mentions is another important feature for the learner. As opposed to promoters, normal users (non-promoters) who show interest in the campaign are willing to retweet, reply or give their own opinions. Another important set of features is the behavioral features. Behavior features capture the characteristics of the two classes of users in terms of their posting patterns. The behavior features are: maximum, minimum, average number of tweets per day, the maximum, minimum, average time interval between two consecutive tweets, total number of tweets, and number of unique URLs tweeted.

89 79 For classification, we use Logistic Regression because it can give the estimated posterior probability for each class, which is useful for LBP. First, we train a Logistic Regression classifier with a small fraction of users that are labeled manually and then run it on the rest of the users to estimate their probabilities of being promoters and non-promoters. Let the promoter class be our positive class and non-promoter class be our negative class. The class probability of a user is computed through Equation 5.8 and Equation 5.9 where k is the total number of features and x j is the j-th feature. P user (+) = e β 0 k j=1 β jx j (5.8) P user ( ) = k e β 0 j=1 β jx j 1 + e β 0 k j=1 β jx j (5.9) URL and Burst Prior: Using the same strategy, a URL can be classified into the promoted or organic class. However, labeling URLs is difficult because there are usually a large number of tweets containing a URL which increases the cost of labeling tremendously. Moreover, tweets associated with a URL can be from both promoters and non-promoters, which further increases

90 80 the labeling difficulty. On the other hand, we can actually get reasonable estimates of class/state probabilities for URL nodes using the labels of users. P url (+) = n + + α n + + n + 2α (5.10) P url ( ) = n + α n + + n + 2α (5.11) If a URL is tweeted more by promoters than non-promoters, it is believed to be promoted. We define promoted URLs as the positive class and organic URLs as the negative class. The prior probability of a URL is calculated from Equation 5.10 and Equation 5.11 where n + is the number of times a URL is mentioned by all the labeled promoters and n is the number of times it is mentioned by all the labeled non-promoters. URLs that are neither tweeted by labeled promoters nor labeled non-promoters have equal probabilities of being in the two states. Even there are much more unique URLs than labeled users, the popular URLs in the campaign could be approximately estimated. We use Laplace smoothing to obtain a smoothed version of estimates. In our experiment, we use α = 1. Similarly, we can estimate the prior belief of a burst in two states: planned or normal, using the same strategy. Planned bursts are dominated by promoters while natural bursts by normal users.

91 81 Algorithm 4 The overall algorithm Input: A set of labeled users U train for training A set of tweets D on a particular topic The propagation matrices ψ i,j (σ i, σ j t i, t j ) Output: Probability estimate of every user being a promoter 1: Train a classifier c from D and U train 2: Apply c on all the unlabeled users to obtain the user priors (node potentials): ψ i (σ i t i = user) 3: Calculate URL and burst priors ψ i (σ i t i = URL) and ψ i (σ i t i = burst) using Equation 5.10 and Equation : Build the User-URL-Burst graph G(V, T, E) from D 5: for (v i, v j ) E do 6: for all states σ j of v j do 7: m i j (σ j t j ) 1 8: end for 9: end for 10: while not converged do 11: for (v i, v j ) E do 12: for all states σ j of v j do 13: update m i j (σ j t j ) in parallel using Equation : end for 15: end for 16: end while 17: Calculate the final belief of every node in all states b i (σ i t i ) using Equation : Output the probability of every user being a promoter b i (σ i = promoter t i = user) Overall Algorithm Finally, we put everything together and present the overall algorithm of the proposed detection technique, which is given in Algorithm 4. Line 1 trains a local classifier c using the available labeled training data. c is then applied to all unlabeled user nodes and assigns each of them a probability of being a promoter (line 2), which is also the node potentials of the user node. Line 3 computes the node potentials for each URL node and each burst node using equations

92 82 10 and 11. Note that the edge potentials are reflected in the propagation matrices in Table X. Line 4 builds the graph G. Lines 5 through 15 correspond to the message passing algorithm of LBP. We first initialize all messages to 1 and update the messages of each node with messages from its neighboring nodes. The normalized belief of each user node in the promoter state (line 18) is the final output. 5.4 Experiments We now evaluate the proposed promoter detection algorithm based on T-MRF. We also compare it with the algorithm in (Benevenuto et al., 2009), which is actually our local classifier, and several other baselines. Note that (Benevenuto et al., 2009) works in the YouTube context. We adapted it to our Twitter context. The main difference is that we have to use a different set of features in learning Datasets and Settings We use three Twitter datasets related to health science to evaluate our model. The first two datasets are about two known anti-smoking campaigns launched by the Centers for Disease Control and Prevention (CDC) from March to June 2012 and from March to June 2013 respectively. The goal of the two anti-smoking campaigns is to raise the awareness of the harm of tobacco smoking by telling the public the real-life stories of several former smokers 1. During the campaign, a large number of tweets were posted by CDC staff and people in their affiliated organizations and individuals, who are promoters. Due to the campaigns, a large number of 1

93 83 individuals from the general public also tweeted about the events and involved web pages and news articles. The third dataset is about electronic cigarettes (or e-cigarettes) tweets that were posted from May to June, 2012, by Twitter users. We do not know any campaign information in this dataset. Our algorithm finds a large number of promotions by different e-cigarettes brands. For each dataset, we set filters to fetch all the relevant tweets from Gnip 2, the largest social media data provider (which owns the full archive of the public facing Twitter data). Gnip allows us to retrieve tweets using a list of filtering rules including keyword matching, phrase matching and logic operations. The datasets were all retrieved and verified by a group of health scientists (the last two authors of the paper and their research team). In our proposed approach, we rely on user behavior features to obtain reasonable prior estimates. So we exclude those users in our dataset who only tweeted once because little evidence or feature can be observed from them. Incorporating single-tweet users will be our future work. Note again that the URLs in users tweets are mostly shortened URLs due to the limits of maximal 140 characters per tweet. We used a Gnip software to expand the shortened URLs to their actual URLs of webpages. In our experiment, we use the expanded URLs to represent URL entities or nodes in T-MRF. Table XI gives the statistics of our three datasets after single-tweet users are removed. The topic of CDC2013 is the same as CDC2012 but with more promotion efforts and more participants. 2

94 84 CDC2012 CDC2013 E-cigarettes users tweets URLs promoters(labeled) non-promoters(labeled) TABLE XI Data statistics For each dataset, we manually labeled 800 users. The labeling was done with the help of the health science researchers. For each user, the labeling decision was made based on the features we defined earlier, the list of URLs he/she tweeted and intents expressed his/her tweets. For each experiment, we perform 5 random runs. For each run, we randomly select 400 users for training and the other 400 users for testing. Each result reported below is the average of the 5 runs. We used logistic regression to build each local classifier, which also provides the prior beliefs of user nodes. We then employ Loopy Belief Propagation to infer the posteriors of each unlabeled node in the network Results Since our promoter detection model yields the probability of each users likelihood of being a promoter, we choose to use the popular Receiver Operating Characteristic (ROC) curve to evaluate its performance. We finally report the Area Under the Curve (AUC).

95 85 CDC2012 CDC2013 E-cigarettes ɛ Local-LR ICA T-MRF(all-nodes,nopriors) T-MRF(user-url) T-MRF(all-nodes, no-user-user) T-MRF(all) TABLE XII AUC (Area Under the Curve) for each dataset, each system and different ɛ values. We compare the following systems. They progressively include more information in the system. The AUC values are also given in Table XII for different ɛ s. Based on the results, we have the following observations: Local-LR: This is the traditional classification approach which does not use any relationships of nodes. This method is similar to (Benevenuto et al., 2009). We use logistic regression (LR) as the learning algorithm as it gives the class probabilities, which we also use as priors in LBP. It is poorer than all others except T-MRF with no priors, which means that local classification is not sufficient and relational information is very useful for classification for all three datasets. ICA: This is the classific collective classification algorithm (Sen et al., 2008) which utilizes all relationships of nodes. We use logistic regression (LR) as base learning algorithm and

96 86 compared it with our proposed T-MRF. As the labels of URLs and bursts are only based on a rough estimation, it does not perform as well as our proposed final T-MRF. T-MRF (all-nodes, no-priors): This baseline uses all three types of nodes, but it does not use any node potential. It thus purely replies on the network effect. Without priors, for initialization every state of a node is assigned the same probability based on the uniform distribution of the states of the node type. It performs the worst compared to other systems. This is understandable because without any reasonable priors, the system has little guidance and thus it is hard to achieve good results. T-MRF (user-url): This baseline uses only two types of nodes, user and URL. Burst nodes are not used in this case. The method discussed in Section is employed to assign prior probabilities to the states of each node. This baseline also uses the edge potentials for user- URL given in Section It does better than Local-LR and T-MRF(all-nodes, no-priors). Although this baseline does not use burst nodes, it uses the edge potentials for users and URLs, which enable the system to do quite well. T-MRF (all-nodes, no-user-user): This model uses all three types of nodes. The priors are computed based on the methods in Section It also uses edge potentials for user-url, user-burst and URL-burst but not user-user. We want to single out and see the effects of useruser potentials separately, which is included in the final system below. It progressively improves further because burst nodes are now used and edge potentials of user-burst and URL-burst are applied. But, in this case, user-user edge potentials are not used.

97 87 T-MRF (all): This is the proposed full system, which uses all three types of nodes, all priors and edge potentials. It uses all information, which represents the full proposed system. It outperforms all baselines. Compared to T-MRF (all-nodes, no-user-user), we see that user-user similarity based potentials are very helpful. From Table XII, we can see that T-MRF (all) makes markedly improvements over Local-LR and T-MRF (all-nodes, no-priors). In summary, we can conclude that the proposed T-MRF method is highly effective. It remarkably improves both the traditional classifier LR and relational classifier ICA acorss all settings of ɛ. This shows that the proposed T-MRF model can capture the dynamics of the problem better than baseline approaches and is also not very sensitive to the choice of ɛ. Further the performance improvement of T-MRF are statistically significant (p < 0.002) according to a paired t-test.

98 CHAPTER 6 CONCLUSIONS In this thesis, we have studied the problem of detecting opinion spammers and campaign promoters in review websites and social media platforms. Towards this direction, we thoroughly explored four important tasks of opinion spam detection: leveraging temporal and spatial patterns of spammers, detecting spammers via positive and unlabeled learning, detecting spammer groups using their collective co-bursting activities and detecting campaign promoters. Our proposed algorithms and models are evaluated by experiments on real-life datasets shared by Dianping. The contributions of our work are summarized as below: First, we studied the problem of temporal and spatial dynamics of opinion spammers. The large amount of data with rich attributes of entities allow us to deeply investigate the differences between spammers and non-spammers that other research work has not exploited. We proposed several novel features and metrics to characterize opinion spammers via temporal, spatial, and many other behavioral dimensions Throughout the analyses, we pinpointed those important attributes of spammers and gave suggestions on how to detect and prevent opinion spamming. Experiment results indicate that the proposed rich features are strong indicators of spammers. We believe that our findings can facilitate studies of spam detection in the broader research community and help review hosting companies build more robust fake review detection systems. 88

99 89 We then studied the problem of fake review detection in Collective PU learning framework. With the labeled data provided from the review hosting websites Dianping, we conduct several experiments to show that combining the collective classification and PU learning, our proposed CPU model has some major advantages. It not only outperforms the strong baselines, but moreover it can identify many hidden fake reviews in the unlabeled instances, which shows the effectiveness of PU learning. Lastly, we studied the problem of identifying hidden campaign promoters on Twitter who promote some target products, services, ideas, or messages. To the best of knowledge, this problem has not been studied before in the Twitter context. We proposed a novel method to deal with the problem based on Markov Random Fields (MRF). Since the traditional MRF does not consider different types of nodes and their diverse interactions, we generalized MRF to T-MRF to flexibly deal with any number of node types and complex dependencies. Our experiments using three health science Twitter datasets show that the proposed method is highly accurate.

100 APPENDICES 90

101 Appendix A : Copyrights 91

Spotting Fake Reviews via Collective Positive-Unlabeled Learning

Spotting Fake Reviews via Collective Positive-Unlabeled Learning Huayi Li, Zhiyuan Chen, Bing Liu, Xiaokai Wei and Jidong Shao Department of Computer Science University of Illinois at Chicago, IL, USA