A Framework for Fake Review Annotation

Size: px

Start display at page:

Download "A Framework for Fake Review Annotation"

Gordon Pearson
5 years ago
Views:

2015 17th UKSIM-AMSS International Conference on Modelling and Simulation A Framework for Fake Review Annotation Somayeh Shojaee, Azreen Azman, Masrah Murad, Nurfadhlina Sharef and Nasir Sulaiman

1 th UKSIM-AMSS International Conference on Modelling and Simulation A Framework for Fake Review Annotation Somayeh Shojaee, Azreen Azman, Masrah Murad, Nurfadhlina Sharef and Nasir Sulaiman Faculty of Computer Science and Information Technology Universiti Putra Malaysia Selangor, Malaysia somayeh.shojaee@gmail.com, azreenazman, masrah, nurfadhlina, nasir }@upm.edu.my Abstract The effectiveness of opinion mining relies on the availability of credible opinion for sentiment analysis. Often, there is a need to filter out deceptive opinion from the spammer, therefore several studies are done to detect spam reviews. It is also problematic to test the validity of spam detection techniques due to lack of available annotated dataset. Based on the existing studies, researchers perform two different approaches to overcome the mentioned problem, which are to hire annotators to manually label reviews or to use crowdsourcing websites such as Amazon Mechanical Turk to make artificial dataset. The data collected using the latter method could not be generalized for real world problems. Furthermore, the former method of detecting fake reviews manually is a difficult task and there is a high chance of misclassification. In this paper, we propose a novel technique to annotate review dataset for spam detection by providing more information and meta data about both reviews and reviewers to the annotators for effective spam annotation. We proposed a framework and developed an on-line annotation system to improve the review annotation process. The system is tested for several reviews from the amazon.com and the results is promising with 0.10 error on labeling. Keywords Opinion mining; review spam; fake review; spammer; annotation; Amazon Mechanical Turk; I. INTRODUCTION In the era of Web 2.0, using social media, review websites or opinion-sharing websites are part of people everyday life. These kind of websites allow people to express their personal experiences, interests and feelings not only about products and services but also social, political and economic issues in the community ([1]). There are obvious benefits for different parties such as companies or governments in understanding what the public think about their products and services. User opinion can have impact on sales of products, change of government policy, and et cetera. However, the widespread sharing and employing of user-contributed reviews have also increased worries about the reliability of them due to high amount of untruthful reviews. These reviews produced by people who do not have personal experience on the subjects of the reviews are called spam, fake, deceptive or shill reviews. Researchers have developed various spam detection techniques in last few years to improve the accuracy of opinionmining results. The major task in these techniques is distinguishing between spam reviews and truthful reviews. As the number of reviews increases, different kinds of methods are established to improve the opinion-mining tasks e.g. [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16] and [17]. Most of the existing studies tried to spot fake reviews that could be detected by a human, but current fake reviews are written more wisely and detecting these kind of reviews is the concern of companies, governments and researchers. On the other hand, to evaluate the proposed methods by researchers, having a labeled data set is necessary. But manually annotating fake reviews is a time consuming and confusing task for annotators. Therefore, one of the important challenge of spam and spammer detection is the lack of labeled dataset. To our knowledge, there is only one publicly available dataset, [4], [5], [6], with true gold standard for the product review domain ([11]). Most of the existing work on review spam detection have either used manual labeling a portion of real review corpus or used crowdsourcing websites to create artificial corpus. In the first method, the researchers hire two or more students to manually label each reviews for spam ([7], [9], [10], [11], [13], [14], [15], [17], [18], [19]) and then calculate inter-evaluators agreement such as by using Cohen (Fleiss) kappa, a measure of the degree of consistency between two (or more than one) rates. Based on the measure, they decide whether to accept annotation or reject. However, fake reviews are not easily determined by a human reader. In the second method of using crowdsourcing websites, researchers try to gather fake or/and real reviews from the crowdsourcing websites by paying money to people to write artificial review ([4], [5], [6], [3]). But the problem of these kinds of corpus is crowdsourced fake reviews may not be representative of real life fake reviews ([8]). Amazon Mechanical Turk is normally used to perform simple tasks which require human judgments. [8] argued that the Turkers did not do a good job at faking and probably the reason is that they did not have enough knowledge of the domain, or did not tried hardly into writing fake reviews as they have little gain in doing so. These may describe why the Amazon /15 $ IEEE DOI /UKSim

2 Mechanical Turk data can attain high fake review detection accuracy. In this paper, we attempt to investigate the problem of annotating review dataset and proposes a novel technique for spam review annotation by providing the annotators with more useful information and meta data of the reviews as well as the reviewers. We collect several hints for spam from existing works and we propose some rules to increase reliability of the labeling task. In our system, we consider the role of reviewer on the process by considering some rules related to reviewers behavior and reviews meta data. The paper is organized as follows: in Section II, we summarize related works on review spam annotation; in Section III, we introduce our annotation framework and the on-line annotation system; also, we evaluate the reliability of our system. In the last Section IV, we provide conclusion to our investigation. II. RELATED WORK To test the accuracy of review/opinion detection tasks, researchers developed two different techniques. The first method is hiring experts to label their corpus manually. Human evaluation of a data set is not new as it has been widely used in information retrieval task evaluation. It is very difficult and there is a high chance of poor annotation because detecting spam reviews by just reading a review without extra knowledge such as information about the review and its reviewer. It has been shown in prior literature that human are not good at detecting deceptions ([20]), including detecting fake reviews ([4]). The second method is using crowdsourcing websites by creating artificial reviews (spam and/or non-spam). The main challenge of this method is unreal reviews can affect the accuracy of evaluation of spam detection techniques and using artificial reviews just applicable for specific kinds of spam detection techniques which the other information about review such as meta data (posting date, feedback, rating,...) or reviewer like the number of reviews posted by the reviewer are not considered. But those kinds of information are very useful during spam detection process. The following includes existing studies which applied two different annotation methods as explained above and also the most common data set that used for review spam detection. In [19], [18], [17], the researchers crawled the amazon.com by collecting 5,838,032 reviews. They used duplicates and near-duplicates as spam reviews and they hired expert for labeling the reviews. In another research, [15], [14], they also crawled the amazon.com. They utilized the Amazon Web services to extract reviews from ten product categories during January A subset from this collection (2,100 spam and 207,900 ham) was then used to build their evaluation dataset. For each product category, two human annotators were appointed to review the candidate spam set. If both human annotators confirmed a spam case, the pair of reviews were added to the confirmed spam set. In [21], they crawled product reviews, which are obtained from epinions.com. The data set consists of about 60,000 reviews. They employed ten college students to annotate the review spam data set. They were first asked to read about spam detection clues and discussions to know what the review spam looks like. They then independently labeled the review data. Each review was labeled by two people and conflicts were resolved by the third one. The authors of [13] also crawled reviews of manufactured products from amazon.com, which are comprised of 53,469 reviewers, 109,518 reviews and 39,392 products. They employed 8 expert judges: employees of Rediff Shopping and ebay.in for labeling their candidate groups. The data in this research [10] crawled from resellerratings.com on Oct. 6th, They cleaned the data by removing users and stores with no review. After that, they have reviewers who wrote 408,470 reviews on 14,561 stores in total. Human evaluation was necessary to judge reviewers. Human evaluators were 3 computer science major graduate students who also had extensive on-line shopping experiences. They work independently on spammer identification. The author of [11] considered crawling hotel reviews from tripadvisor.com for nearly 4,000 hotels located in 21 big cities such as London, New York, and Chicago. The crawled data amounts to 839,442 reviews over the period of They mixed and matched the gold standard data ([4], [5], [6]) and their pseudo-gold standard data in three different combinations as follows: (a) rule, gold: Train on the dataset with pseudo gold standard determined by one of the strategies, and test on the gold standard dataset. (b) gold, rule: Train on gold standard dataset and test on pseudo gold standard dataset. (c) rule, rule: Train and test on the pseudo gold standard dataset. In another study, [9], the review data collected from resellerratings.com on Oct 6th, It contains 408,469 reviews written by 343,629 reviewers for 25,034 stores. 90% of reviewers wrote only one reviews and about 76% of the reviews are single reviews. They focused on stores with large number of single reviews, so in the evaluation they selected top 53 stores, each of which has more than 1,000 reviews. They asked human evaluators to read the reviews from all 53 stores and made decisions regarding the suspiciousness of these stores. If two or more evaluators vote a store as being likely to have committed an single review spam attack, they tagged it to be a likely dishonest store. According to the human evaluation, there were a total of 29 stores having at least two votes. In [7], they conducted experiments on their own the amazon.com electronic reviews. The prepared dataset consist of 6,489 reviews written by 1,078 reviewers. They employed several college students to annotate the dataset. They are first asked to read all these guideline website and research papers. 154

3 After learning these spam signals and suggestions on how to spot fake reviews, then they independently labeled the review dataset. Each review and reviewer annotated by two different students independently. If a review of a reviewer get different label, it annotated by another two different students. Some studies choose second technique for data collection by generating artificial reviews. In [4], [5], [6], the trustful reviews are collected from 20 most popular Chicago hotels from TripAdvisor and deceptive reviews gathered using the Amazon Mechanical Turk (AMT) from those same hotels. This corpus includes: 400 truthful positive reviews from TripAdvisor and 400 deceptive positive reviews from Mechanical Turk, and 400 truthful negative reviews from Expedia, Hotels.com, Orbitz, Priceline, TripAdvisor and Yelp and 400 deceptive negative reviews from the Amazon Mechanical Turk. These yield the final corpus of 1,600 reviews. In the other study, [3] shill reviews gathered by employing student to write shill comments and for truthful reviews, they collected reviews having amazon verified purchased sign from the amazon.com. Algorithm II.1: REVIEW ANNOTATION(Q, Q ) comment: reviewer set R; review set R ; comment: reviewer question set Q; review question set Q ; while (R) while (Q) if Q j == 1 k k +1 do then j j +1 else j j +1 while (R ) while (Q ) if Q i == 1 do k k +1 do then i i +1 else i i +1 do if k > threshold k then y 1 else if k>threshold k then y 1 else y 0 III. FRAMEWORK DEVELOPMENT AND EVALUATION We proposed a framework to improve manually labeling process by simultaneously considering clues, in form of questions, from both reviews and reviewers. In this framework for each set of reviews, we present all reviews which are written by same reviewers to the annotator at the same time. The 11 questions for spam detection and the 5 questions for spammer detection are extracted from existing work on the respective problems. The questions are selected based on feature selection wrapper method due to its ability to take into account feature dependencies. The TABLE I shows questions based on the selected subset of features that are used as clues. The Algorithm III.1 presents the labeling process. For each reviewer in R, it calculates the number of spammer detection clues which are applicable for the reviewer by add one to k. Then for every review of the reviewer, it calculates the number of spam detection clues which are applicable for the review by adding one to k. In case k is greater than threshold k, which is set to 5 for our test, the review label, y, will be set to 1, or spam. Otherwise, it checks the threshold k, which is set to 2 for our test, if k is higher than threshold k then y will be set to 1, too. If both k and k are less than threshold k and threshold k respectively, then the review label will be set to 0. For instance, for a reviewer with more than 1,000 reviews if four out of totally 5 spammer questions are answered positively then k =4. After that for each reviews of the reviewer, k will be calculated. If k for the review is greater than 5 then y =1, the review will be annotated as a spam. The advantages of this system is providing hints and also giving some useful meta data about review and reviewer to annotators. Labeling a review as spam or non-spam by just looking at the content of a review is a challenging task even for experts. Therefore, such a system can help to improve the annotation process. We implemented the framework using the hints for spam and spammer detection and applying the above algorithm. The Figure 1 is a snapshot of the on-line annotation system that applying the proposed framework. To evaluate the proposed framework, we conduct a testing by using 50 annotated reviews from the amazon.com. The genuine reviews are collected based on the Amazon Verified Purchase sign, a feature of the amazon.com site to confirm that the customer who wrote the review purchased the item at the amazon.com. The fake reviews are selected from reviewers with more than 10,000 reviews without any Amazon Verified Purchase sign. A user with a lot of reviews for different products are more likely to be a spammer [17]. Therefore, we selected fake reviews from reviewer who has high chance of being a spammer. We used the following equation to calculate the difference between predicted labels using our on-line system and real labels. error = mean(labeled pred(:) = labels(:)) (1) 155

4 TABLE I SPAM AND SPAMMER DETECTION HINTS Spam Detection Hints Question Reference Is this review more general by just using more obvious product official features rather than unofficial features? [3] Is this review unrelated to the product? [22], [3] Is it only full of meaningless adjectives and buzzwords? [3] Is this review difficult to understand? [3] Does this review repeat the product name over and over? [18], [23], [21] Does it contain promotion code, URL, address or phone number? [22] Does this review contain more verbs and personal pronouns than nouns, adjectives and prepositions? [3] Does this review contain the brand approved version of the product name? [23] Does this review like marketing speak? [4], [5], [6] Does this review like competitors war? [4], [5], [6] Does this review have long-winded explanations or short words? [24], [25], [19] Spammer Detection Hints Question Reference Does this reviewer always give only good or bad comment? [26], [10] Does this reviewer write review for different type of products or brand? [18], [26], [27] Does this reviewer write multiple reviews for a single product? [18], [26], [27] Does this reviewer write similar reviews for different product? [18], [26], [27] Does this reviewer have a high-frequency commenting time series? [28], [27] Figure 1. Annotation System 156

5 The result is promising showing that 0.10 error, on average from every 10 reviews only one of them is labeled incorrectly, when the labeled sample is applied. IV. CONCLUSION In this paper, we proposed a framework to annotate review corpora for fake review detection. We selected most effective features using feature selection wrapper method to develop our two set of clues, in form of questions, for spam and spammer detection. An algorithm is developed to calculate label for each review based on provided hints for each reviewer and his reviews. The implemented system based on the proposed framework is tested using a dataset include of genuine and fake reviews with 0.10 error in predicting reviews label. The low level of misclassification illustrates that labeling reviews by having more knowledge about both review and reviewer seems to be much precise than labeling individual fake reviews. We believe that unlike the previous approaches, the framework and the system gives a better context for judging and comparison. In future work, we will further improve the proposed framework by investigating more hints and evaluating our system using bigger dataset. Also, giving weight to each feature to highlight the role of the most effective features on the labeling process to decrease misclassification. ACKNOWLEDGMENT The authors would like to thank the Malaysia Ministry of Science, Technology and Innovation and also Universiti Putra Malaysia for funding of this research. This work is supported by the Malaysia Ministry of Science, Technology and Innovation, ScienceFund, project number SF1688 and the Universiti Putra Malaysia, Research University Grant Scheme (RUGS), project number RU. REFERENCES [1] E. T. Anderson and D. I. Simester, Reviews Without a Purchase: Low Ratings, Loyal Customers, and Deception, Journal of Marketing Research, vol. 51, no. 3, pp , Jun [2] S. Shojaee, M. A. A. Murad, A. B. Azman, N. M. Sharef, and S. Nadali, Detecting Deceptive Reviews Using Lexical and Syntactic Features, in 13th International Conference on Intelligent Systems Design and Applications. IEEE, Dec. 2013, pp [3] T. Ong, M. Mannino, and D. Gregg, Linguistic Characteristics of Shill Reviews, Electronic Commerce Research and Applications, vol. 13, no. 2, pp , Mar [4] M. Ott, Y. Choi, C. Cardie, and J. T. Hancock, Finding Deceptive Opinion Spam by Any Stretch of the Imagination, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011, pp [5] M. Ott, C. Cardie, and J. Hancock, Estimating the Prevalence of Deception in Online Review Communities, in Proceedings of the 21st International Conference on World Wide Web - WWW 12. New York, New York, USA: ACM Press, 2012, p [6] M. Ott, C. Cardie, and J. T. Hancock, Negative Deceptive Opinion Spam, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, Georgia, USA: Association for Computational Linguistics, Jun [7] Y. Lu, L. Zhang, Y. Xiao, and Y. Li, Simultaneously Detecting Fake Reviews and Review Spammers Using Factor Graph model, in Proceedings of the 5th Annual ACM Web Science Conference on - WebSci 13, no. ii. New York, New York, USA: ACM Press, 2013, pp [8] A. Mukherjee, V. Venkataraman, B. Liu, and N. S. Glance, What Yelp Fake Review Filter Might Be Doing? Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (ICWSM), 2013, pp [9] S. Xie, G. Wang, S. Lin, and P. S. Yu, Review Spam Detection via Temporal Pattern Discovery, Proceedings of the 21st International Conference Companion on World Wide Web. New York, NY, USA: ACM, 2012, pp [10] G. Wang, S. Xie, B. Liu, and P. S. Yu, Identify Online Store Review Spammers via Social Review Graph, ACM Trans. Intell. Syst. Technol., vol. 3, no. 4, p. 61, Sep [11] S. Feng, L. Xing, A. Gogar, and Y. Choi, Distributional Footprints of Deceptive Product Reviews, in Sixth International AAAI Conference on Weblogs and Social MediaI (CWSM), 2012, pp [12] M. G. Frank, M. A. Menasco, and M. O. Sullivan, Human Behavior and Deception Detection, in Wiley Handbook of Science and Technology for Homeland Security. John Wiley & Sons, Inc., 2008, vol. 5, pp [13] A. Mukherjee and B. Liu, Modeling Review Comments, in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL 12. Association for Computational Linguistics, 2012, pp [14] R. Y. K. Lau, S. Y. Liao, R. C.-W. Kwok, K. Xu, Y. Xia, and Y. Li, Text Mining and Probabilistic Language Modeling for Online Review Spam Detection, ACM Trans. Manage. Inf. Syst., vol. 2, no. 4, pp , Jan [15] C. L. Lai, K. Q. Xu, R. Y. K. Lau, Y. Li, and D. Song, High-Order Concept Associations Mining and Inferential Language Modeling for Online Review Spam Detection, Data Mining Workshops (ICDMW), 2010 IEEE International Conference on, Dec. 2010, pp [16] F. Iqbal, H. Binsalleeh, B. C. M. Fung, and M. Debbabi, Mining Writeprints From Anonymous s for Forensic Investigation, Digit. Investig., vol. 7, no. 1-2, pp , Oct [17] N. Jindal, B. Liu, and E.-P. Lim, Finding Unusual Review Patterns Using Unexpected Rules, Proceedings of the 19th ACM International Conference on Information and Knowledge Management. New York, NY, USA: ACM, 2010, pp [18] N. Jindal and B. Liu, Opinion Spam and Analysis, Proceedings of the International Conference on Web Search and Web Data Mining. New York, NY, USA: ACM, 2008, pp [19] N. Jindal and B. Liu, Analyzing and Detecting Review Spam, in Seventh IEEE International Conference on Data Mining (ICDM 2007). Ieee, Oct. 2007, pp [20] A. Vrij, S. Mann, S. Kristen, and R. Fisher, Cues to Deception and Ability to Detect Lies as a Function of Police Interview Styles, Law and Human Behavior, vol. 31, no. 5, pp , [21] F. Li, M. Huang, Y. Yang, and X. Zhu, Learning to Identify Review Spam, Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence. AAAI Press, 2011, pp [22] C. Huang, Q. Jiang, and Y. Zhang, Detecting Comment Spam through Content Analysis, Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2010, vol. 6185, pp [23] K.-H. Yoo and U. Gretzel, Comparison of Deceptive and Truthful Travel Reviews, in Information and Communication Technologies in Tourism Springer Vienna, 2009, pp [24] J. K. Burgoon, J. P. Blair, T. Qin, and J. F. J. Nunamaker, Detecting Deception through Linguistic Analysis, Lecture Notes in Computer 157

6 Science. Springer Berlin Heidelberg, 2003, vol. 2665, ch. Detecting, pp [25] N. Jindal and B. Liu, Review Spam Detection, Proceedings of the 16th International Conference on World Wide Web. Banff, Alberta, Canada: ACM, Oct. 2007, pp [26] B. Liu and L. Zhang, A Survey of Opinion Mining and Sentiment Analysis, in Mining Text Data, C. C. Aggarwal and C. Zhai, Eds. Boston, MA: Springer US, 2012, pp [27] Y. Lini, H. Wui, and J. Zhangl, Towards Online Anti-Opinion Spam: Spotting Fake Reviews from the Review Sequence, Advances in Social Networks Analysis and Mining (ASONAM 2014), 2014 IEEE/ACM International Conference on, pp [28] Q. Wang, B. Liang, W. Shi, Z. Liang, and W. Sun, Detecting Spam Comments with Malicious Users Behavioral Characteristics, Information Theory and Information Security (ICITIS), 2010 IEEE International Conference on, Dec. 2010, pp

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review