Analyzing and Detecting Review Spam

Size: px

Start display at page:

Download "Analyzing and Detecting Review Spam"

Percival Fowler
5 years ago
Views:

1 Seventh IEEE International Conference on Data Mining Analyzing and Detecting Review Spam Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago Abstract Mining of opinions from product reviews, forum posts and blogs is an important research topic with many applications. However, existing research has been focused on extraction, classification and summarization of opinions from these sources. An important issue that has not been studied so far is the opinion spam or the trustworthiness of online opinions. In this paper, we study this issue in the context of product reviews. To our knowledge, there is still no published study on this topic, although Web page spam and spam have been investigated extensively. We will see that review spam is quite different from Web page spam and spam, and thus requires different detection techniques. Based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that review spam is widespread. In this paper, we first present a categorization of spam reviews and then propose several techniques to detect them. 1. Introduction The Web has dramatically changed the way that people express themselves and interact with others. They can now post reviews of products at merchant sites (e.g., amazon.com) and express their views in blogs and forums. Such content contributed by Web users is called the user-generated content. It is now well recognized that the user generated content contains valuable information that can be exploited for many applications. In this paper, we only focus on product reviews. In particular, we investigate review spam. It is now a common practice for e-commerce Web sites to enable their customers to write reviews of products that they have purchased. The reviews are then used by potential customers to find opinions of existing users before purchasing the products. They are also used by manufacturers to identify problems in their products and/or to find competitive intelligence information about their competitors [3, 6, 11]. In the past few years, there was a growing interest in mining opinions expressed in reviews due to many practical applications. Reviews are useful to both individual consumers and product manufacturers. For example, if one wants to buy a product, one typically goes to a merchant site (e.g., amazon.com) to read some reviews of existing users of the product. If the reviews are mostly positive, one is very likely to buy the product. If the reviews are mostly negative, one will most likely choose another product. Positive opinions can result in significant financial gains or fames for organizations and individuals. This gives good incentives for review spam. Existing work has been focused on extracting and summarizing opinions in reviews [3, 6]. Little is known about the trustworthiness of reviews or detection of spam. Review spam is similar to Web page spam. In the context of search, due to the economic and/or publicity value of the rank position of a Web page returned by a search engine, Web page spam is widespread [1, 13]. Web page spam refers to the use of illegitimate means to boost the rank positions of some target pages in search engines [5, 9]. In the context of reviews, the problem is similar, but also quite different (see Section 2). In this paper, we study review spam based on 5.8 million reviews and 2.14 million reviewers (members who wrote at least one review) from amazon.com. We discovered that spam activities are widespread. For example, we found a large number of duplicate and near-duplicate reviews written by the same reviewers on different products or by different reviewers (possibly different userids of the same persons) on the same products or different products. This paper makes the following two main contributions: (1) Review spam categorization: It presents a categorization of review spam. We found three main types of spam reviews. To our knowledge, this is the first report of such a categorization. It will form the basis for future research of review spam. (2) Review spam analysis and detection: It proposes some novel techniques to study review spam and spam detection. In general, spam detection can be /07 $ IEEE DOI /ICDM

2 regarded as a classification problem with two classes, spam and non-spam. However, due to the specific nature of different types of spam, we need to deal with them differently. For two types of spam reviews, we can detect them based on supervised learning because these two types of reviews are recognizable manually and thus training data can be labeled manually. However, for the type of spam reviews, which we call false opinions, manual labeling by simply reading reviews is very hard, if not impossible, because a spammer can carefully craft a spam review to promote a target product or to damage the reputation of another product that is just like any innocent review. We then discuss a novel way to study this problem using some duplicate reviews which are almost certainly spam. 2. Related Work Although mining opinions (positive and negative) from reviews became a popular research topic in recent years [3, 6], there is still no reported study on review spam. Here we only discuss some existing research on other types of spam. Perhaps, the most extensively studied topic on spam is Web spam. Web spam can be categorized into two main types: content spam and link spam. Link spam is spam on hyperlinks, which does not exist in reviews as there is usually no link among reviews. Content spam tries to add irrelevant or remotely relevant words in target pages to fool search engines to rank the target pages high. A taxonomy of Web spam is given in [5]. Many researchers have studied this problem [e.g., 1, 5, 13]. Review spam is very different. Adding irrelevant words has little effect. Instead, spammers write undeserving positive reviews to promote some objects and/or malicious negative reviews to damage the reputation of some other objects. These false opinion spam reviews are very hard to detect. Another related research is spam [4, 12], which is also quite different from review spam. spam usually refers to unsolicited commercial advertisements. Although exist, advertisements in reviews are not as frequent as in s. They are also relatively easy to detect (see Section 4.2.3). Recent studies on spam also extended to recommender systems [8]. Although the objectives of spam on recommender systems are similar to review spam, their basic ideas are different. In recommender systems, a spammer injects some attack profiles to the system in order to get some products more (or less) frequently recommended. A profile is a set of ratings (e.g., 1-5) for a series of products. The spammer usually does not see other users rating profiles and thus has to make guesses. In the context of product reviews, a reviewer sees all reviews for every product. Rating is only part of a review and another main part is the review text. [14] studies the utility of reviews using natural language features. Spam is a much broader concept involving all types of objectionable activities. 3. Categorization of Spam Reviews We now present the review spam categorization, which is compiled based on extensive analysis of customer reviews from amazon.com Review Data from Amazon.com In this work, we use reviews from amazon.com. The reason for using this data set is that it is large, covers a very wide range of products and has a relatively long history. It is a reasonable representative review data set. The reviews were crawled in June We were able to extract 5.8 million reviews, 2.14 million reviewers and 6.7 million products (the exact number of products offered by amazon.com could be much larger since amazon.com only displays a maximum of 9600 products for each sub-category). Each amazon.com s review consists of 8 fields: <Product ID> <Reviewer ID> <Rating> <Date> <Review Title> <Review Body> <Number of Helpful Feedbacks> <Number of Feedbacks>. We used 4 main categories of products in our study, i.e., Books, Music, DVD and mproducts (manufactured products such as electronics, computers, etc). The number of reviews, reviewed products and reviewers in each category in our study is given in Table 1. Table 1. Number of reviews, reviewed products and reviewers Category Reviews Reviewed Products Reviewers All Books DVD Music mproducts Categorization of Review Spam There are three main types of spam reviews. Type 1 (False Opinions): Such reviews contain false opinions on products and are thus very harmful. a. Positive spam review: Such a review expresses an undeserving positive opinion on a product with the agenda of promoting the product. b. Negative spam review: Such a review expresses a malicious negative opinion on a product with the intension of damaging its reputation

3 Type 2 (Reviews on brands only): Such reviews do not comment on the product itself but only express opinions on the brand (or manufacturer or seller), e.g., I don t trust HP, and never bought anything from them. Although this review expresses an opinion, it is not on the specific product and can often be highly biased. Type 3 (Non-reviews): Such reviews contain no opinions, and thus do not serve the purpose of reviews. Although they may not affect human users who read them as they can be recognized easily, they affect automated opinion mining systems that aggregate review ratings because these reviews also contain ratings which may just be randomly assigned. There are two main sub-categories. Advertisements: In such reviews, reviewers list a set of product features or accessories. Although they may not contain any false information, they are considered spam as they contain no opinions. There are three main kinds of advertisements: a. Same product: The review describes some features or use of the product, e.g., Detailed Product Specs: Standards * g, b, INMPR Compliant, TCP/IP, UPnP AV 1.0, USB 2.0,., which simply lists all product features. b. Different Product: The review promotes some competing products from the same or different brand. This is similar to the above case, but advertising for a different product. c. Different Seller: The review promotes a different seller or Web site for the product, e.g., This is a great product but can be bought for less at: compuplus.com, which advertises for a competing site selling the same product. Other non-reviews: The rest of non-reviews also consist of several types: a. Question or answer. The reviewer asks or answers questions or doubts about the product from fellow reviewers, e.g., What port it is for? AGP or PCI Express?? From the looks of the picture it seems like the PCI Express version. Can anyone confirm this?, which asks a question about a graphics card. b. Comment. The review comments on some other reviews, e.g., This Other Review is too funny. c. Random text. The review just contains some random text completely unrelated to the product, e.g., Go Eagles Go, which is for adobe acrobat and is unrelated to the product. 4. Spam Detection In general, spam detection can be regarded as a classification problem with two classes, spam and nonspam. Machine learning models may be built to classify each review as spam or non-spam, or to give a probability likelihood of each review being a spam. To build a classification model, we need labeled training examples of both spam reviews and non-spam reviews. However, for the three types of spam, we can only manually label training examples for type 2 and type 3 as they are recognizable based on their contents. Recognizing whether a review is a false opinion spam (type 1), however, is extremely difficult by reading the review because one can carefully craft a spam review which is just like any innocent review. We tried to read a large number of reviews and were unable to reliably identify type 1 spam reviews manually. Thus, other means have to be explored in order to find training examples for detecting possible type 1 spam reviews. Interestingly, in our analysis, we found a large number of duplicate and near-duplicate reviews. Our manual inspection of such reviews shows that they definitely contain type 2 and type 3 spam reviews. We are also sure that they contain type 1 spam reviews because of the following types of (near-) duplicates: 1. Duplicates from different userids on the same product. 2. Duplicates from the same userid on different products. 3. Duplicates from different userids on different products. Most of such reviews (excluding types 2 and 3 spam) are almost certainly false opinion spam (type 1). Our spam detection strategy: (1) detect duplicates and near-duplicates, (2) detect spam reviews of type 2 and type 3 based on supervised learning using manually labeled training examples, and (3) detect type 1 spam by exploiting the three types of duplicates above and other relevant information Detection Duplicate Reviews Duplicates and near-duplicates can be detected using the shingle method in [2]. In this work, we use 2-gram based review content comparison. Review pairs with similarity score of at least 90% were chosen as duplicates. Fig. 1 plots the log of the number of review pairs with the similarity scores for four different product sub-categories, each belonging to one of the four major categories: books, music, DVDs and mproducts. The sub-categories are word literature ( reviews), progressive music (65682 reviews), drama ( reviews), and office electronic products (22020 reviews). All the sub-categories behave similarly. We also compared the reviews of other sub-categories. The behaviors are about the same. Due to space limitations, we are unable to show all of them

4 Num Pairs Office Electronics Drama DVDs Word Literature Books Progeressive Music Similarity Score Fig. 1. Similarity scores and number of pairs of reviews from different sub-categories: Points on X-axis are intervals. For example, 0.5 means the interval [0.5, 0.6). Fig. 1 shows that the number of pairs decreases as the similarity score increases. It rises after the similarity score of 0.5 and 0.6. The rise is likely due to the cases that people copied their reviews on one product to another or to the same product. Further study shows that about 10% of the reviewers with more than one review have duplicate reviews. In 40% of these cases, the reviews were written on the same day and were exact duplicates. In 30% of the cases, reviews were written on the same day but had some attributes that were different. Note that in many cases if a person has more than one review on a particular product, these reviews are mostly exact duplicates. However, we do not regard them as spam as they could be due to clicking the submit button more than once. We checked the amazon.com site and found that this was indeed possible. Some others were also due to correction of mistakes in previous submissions. For spam removal, we can delete all duplicate reviews which belong to any one of the three types described above. For other kinds of duplicates, we may want to keep only the last copy and remove the rest. Table 2 shows the numbers of likely spam reviews in the above three categories. The first number in column 2 of each row is the number of such reviews in the whole review database. The second number within () is the number of such cases in the category mproducts. In the following study, we focus only on reviews in the category of mproducts, which has reviews. Reviews in other categories can be studied similarly. Note that in some cases, the same person writes the same review for different versions of the same product (hardcover and paper cover of the same book) may not be spam. Out of the total of 4488 reviews, about 30% of them are from reviewers on more than one product. We manually checked the products which had exactly the same reviews. We found that these products have at least one feature different, e.g., two televisions with Table 2. Three types of duplicate spam reviews on all products and on category mproducts Num Reviews Spam Review Type (mproducts) 1 Different userids on the same product 3067 (104) 2 Same userid on different products (4270) 3 Different userids on different products 1383 (114) Total (4488) different dimensions. We labeled them as the same or different products based on the significance of the features that are different. Only a small percentage of products were labeled as the same, and many duplicate reviews on these products were also suspicious. Thus we consider all such duplicates as spam Detecting Type 2 & Type 3 Spam Reviews As we mentioned in Section 3, type 2 and types 3 spam reviews are recognizable manually. Thus, we use supervised learning to detect them. We manually labeled 470 spam reviews of the two types. The breakdown is given in column 2 of Table 3 in Section We did not label more as the proportion of such reviews is extremely small. Manual labeling is very time-consuming. Based on this set of labeled examples, we are already able to achieve very good classification results (Section 4.2.3) Model Building Using Logistic Regression For model building, we used logistic regression. The reason for using logistic regression is that it produces a probability estimate that each review is a spam review, which is highly desirable. In practice, the probabilistic output can be used to weight each review. Since the probability reflects the likelihood that a review is a spam, those reviews with high probabilities can be weighted down to reduce their effects on opinion mining; thus, no need to remove any review as spam. We used the statistical package R ( to perform logistic regression. The AUC (Area under ROC Curve) is employed to evaluate the classification results, which is a standard measure used in machine learning to assess the model quality. Apart from using logistic regression, we also tried SVM, decision tree, and naïve Bayesian classification, but they gave poorer results and are thus not included. Below, we describe features used in learning Feature Identification and Construction There are three main types of information related to a review: (1) the content of the review, (2) the reviewer

5 who wrote the review, and (3) the product being reviewed. We thus have three types of features: (1) review centric features: characteristics of reviews. (2) reviewer centric features: characteristics of reviewers. (3) product centric features: characteristics of products. For some features, we need to divide products and reviews into three types based on their average ratings (rating scale: 1-5): Good (rating 4), bad (rating 2.5) and Average, otherwise Review Centric Features 1. Number of feedbacks (F1), number of helpful feedbacks (F2) and percent of helpful feedbacks (F3) that the review gets. Intuitively, feedbacks are useful in judging the review quality. 2. Length of review title (F4) and length of review body (F5). These features were chosen since longer reviews tend to get more user attention. So, a spammer might use this to his/her advantage. 3. Position of the review in the reviews of a product sorted by date, in ascending (F6) and descending (F7) order. These features were chosen since earlier reviews tend to have more impact on the sale of a product and thus may be exploited by spammers. We also use binary features to indicate if a review is the first review (F8) or the only review (F9). 4. Textual features: a. Percent of positive (F10) and negative (F11) opinion-bearing words in the review, e.g., beautiful, bad and poor. We obtained the list of words from the authors of [6]. We also added a set of other words of our own. b. Cosine similarity (F12) of the review and product features (which are obtained from the product description page at amazon.com). c. Percent of times the brand name (F13) is mentioned in the review. This feature was used for reviews which praise or criticize the brand. d. Percent of numerals (F14), capitals (F15) and all capital (F16) words. Excessive use of numerals signifies too much technical detail (thus nonreviews). Capitals and all capitals signify poorly written and unrelated reviews. 5. Rating related features a. Rating (F17) of review and its deviation (F18) from the average product rating. Feature (F19) indicating if the review is good, average or bad. b. Binary features indicating whether a bad review was written just after the first good review of the product and vice versa (F20, F21). Reviewer Centric Features 1. Ratio of number of reviews that the reviewer wrote which were the first reviews (F22) of the products to the total number of reviews that he/she wrote, and ratio of the number of cases in which he/she was the only reviewer (F23). 2. Rating related features: Average rating given by the reviewer (F24), standard deviation in rating (F25) and a feature indicating if the reviewer always gave only good, average or bad rating (F26). 3. Binary features indicating whether the reviewer gave more than one type of rating, i.e. good, average and bad. There are four cases: a reviewer gave both good and bad ratings (F27), good rating and average rating (F28), bad rating and average rating (F29) and all three ratings (F30). 4. Percent of times that the reviewer wrote a review with binary features F20 (F31) and F21 (F32). Product Centric Features 1. Price (F33) of the product. 2. Sales rank (F34) of the product. These features are used since spam may be focused on cheap/expensive or less selling products. 3. Average rating (F35) and standard deviation in ratings (F36) of the reviews on the product Experimental Results We run logistic regression on the data using 470 spam reviews for positive class and rest of the reviews for negative class. Spam reviews discussed in Section 4.1 are not used since they are duplicates. The average AUC values based on 10-fold cross validation are given in Table 3. Table 3. AUC values for different types of spam Spam Type Num reviews AUC AUC text features only AUC w/o feedbacks Types 2 & % 90% 98% Type 2 only % 88% 98% Type 3 only % 92% 98% From the table, we observe that the AUC value for all spam types is 98.7%. Using only textual features does not perform as well. Without using feedback features, the same results can be achieved. This is important because feedbacks can be spammed too. 5. Type 1 Spam Reviews Section 4 allows us to conclude that that type 2 and type 3 spam reviews are fairly easy to detect. Detecting type 1 spam reviews is, however, very difficult. As we mentioned earlier, it is almost impossible to recognize type 1 spam manually. Thus, we do not have manually labeled training data for learning. In order to investigate type 1 spam reviews, let us

6 first analyze what kinds of reviews are harmful and are likely to be spammed. Recall that type 1 spammer aims (1) to promote some target objects by writing undeserving positive reviews on them, and/or (2) to damage the reputation of some other objects by writing malicious negative reviews on them. To achieve the above two objectives, a spam review s rating needs to deviate from the average product rating (outlier reviews). For example, a spam review should give negative rating to a good product. Clearly, a spam review which gives a positive rating to a good product is not very harmful. Thus, spam detection should focus on outlier reviews. Making use of duplicates: Since we have no manually labeled examples to build spam detection model to identify type 1 spam, we have to look from other sources. A natural choice is the three types of duplicates discussed in Section 4.1, which are almost certainly spam reviews. That is, we use these duplicates as positive examples (spam reviews) and the rest of the reviews as negative examples (non-spam reviews). We still use logistic regression for model building based on the same set of features as described in Section (no feature overfits duplicates. 10-fold cross validation was able to give us the AUC of 78% for duplicate reviews, which is quite high considering that non-duplicate reviews also contain spam. Using this model, we tried to classify many types of interesting reviews and found: 1. The model built using duplicates is able to predict several types of outlier reviews (harmful spam reviews are outlier reviews, but vice versa). 2. User feedback on reviews is not effective in filtering out spam. 3. Many top-ranked reviewers may have written spam reviews. 4. Products with only a single review are very likely to be spammed. Due to space limitations, we are unable to provide the detailed analysis and results, which will appear in a future publication. 6. Conclusions This paper studied review spam and spam detection (apart from our earlier poster [7]). Three main types of spam were identified. Detection of such spam is done first by detecting duplicate reviews. We then detected type 2 and type 3 spam reviews by using supervised learning with manually labeled training examples. Results showed that the logistic regression model is highly effective. However, to detect type 1 spam reviews, the story is quite different because it is very hard to manually label training examples for type 1 spam. We presented an approach to use three kinds of duplicates, which are very likely to be spam, as positive training examples to build a classification model. The results are promising. The current study only represents an initial investigation of review spam. Much work remains to be done. In our future work, we will further improve the detection methods, and also look into spam in other kinds of media, e.g., forums and blogs. 7. Acknowledgement This research was funded by Microsoft Corporation. We thank Ling Bo for many useful discussions. 8. References [1]. R. Baeza-Yates, C. Castillo & V. Lopez. PageRank increase under different collusion topologies. AIRWeb 05, [2]. A. Z. Broder. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences [3]. K. Dave, S. Lawrence & D. Pennock. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. WWW [4]. I. Fette, N. Sadeh-Koniecpol, A. Tomasic. Learning to Detect Phishing s. WWW [5]. Z. Gyongyi and H. Garcia-Molina. Web Spam Taxonomy. Tech. Report, Stanford University, [6]. M. Hu & B. Liu. Mining and summarizing customer reviews. KDD [7]. N. Jindal and B. Liu. Review Spam Detection. WWW (poster paper) [8]. B. Mobasher, R. Burke & J. J Sandvig. Modelbased collaborative filtering as a defense against profile injection attacks. AAAI'2006. [9]. A. Ntoulas, M. Najork, M. Manasse & D. Fetterly. Detecting Spam Web Pages through Content Analysis. WWW [10]. B. Pang, L. Lee & S. Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. EMNLP [11]. A-M. Popescu and O. Etzioni. Extracting Product Features and Opinions from Reviews. EMNLP [12]. M. Sahami, S. Dumais, D. Heckerman and E. Horvitz. A Bayesian Approach to Filtering Junk {E}-Mail. AAAI Tech. Report WS-98-05, [13]. B. Wu, V. Goel & B. D. Davison. Topical TrustRank: using topicality to combat Web spam. WWW'2006. [14]. Z. Zhang & B. Varadarajan, Utility scoring of product reviews, CIKM

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology Seminar: Future Of Web Search University of Saarland Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood Tutor : Klaus Berberich Date : 17-Jan-2008 The Agenda