Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Similar documents
Method to Study and Analyze Fraud Ranking In Mobile Apps

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Opinion Spam and Analysis

Automatic Domain Partitioning for Multi-Domain Learning

Review Spam Analysis using Term-Frequencies

Detecting Opinion Spammer Groups through Community Discovery and Sentiment Analysis

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

A Framework for Fake Review Annotation

ISSN (PRINT): , (ONLINE): , VOLUME-5, ISSUE-2,

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Supervised classification of law area in the legal domain

Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

NLP Final Project Fall 2015, Due Friday, December 18

Detecting Opinion Spammer Groups and Spam Targets through Community Discovery and Sentiment Analysis

Building Corpus with Emoticons for Sentiment Analysis

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

A Content Vector Model for Text Classification

Best Customer Services among the E-Commerce Websites A Predictive Analysis

String Vector based KNN for Text Categorization

ISSN (Print) DOI: /sjet. Research Article. India. *Corresponding author Hema Dewangan

Fraud Detection of Mobile Apps

Feature Selection for an n-gram Approach to Web Page Genre Classification

Domain-specific user preference prediction based on multiple user activities

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Spam Classification Documentation

Identifying Fraudulently Promoted Online Videos

Sentiment Analysis on Twitter Data using KNN and SVM

Spotting Fake Reviews via Collective Positive-Unlabeled Learning

Search Engines. Information Retrieval in Practice

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Fake and Spam Messages: Detecting Misinformation During Natural Disasters on Social Media

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

WordNet-based User Profiles for Semantic Personalization

2. Design Methodology

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Classification. I don t like spam. Spam, Spam, Spam. Information Retrieval

Using PageRank in Feature Selection

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

Naïve Bayes for text classification

Using Text Learning to help Web browsing

Automated Online News Classification with Personalization

Sentiment Analysis using Support Vector Machine based on Feature Selection and Semantic Analysis

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Query Difficulty Prediction for Contextual Image Retrieval

Information-Theoretic Feature Selection Algorithms for Text Classification

Sentiment Analysis for Amazon Reviews

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

Using PageRank in Feature Selection

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

A Framework for Web Host Quality Detection

Natural Language Processing on Hospitals: Sentimental Analysis and Feature Extraction #1 Atul Kamat, #2 Snehal Chavan, #3 Neil Bamb, #4 Hiral Athwani,

Sentiment analysis under temporal shift

Ontology-Based Web Query Classification for Research Paper Searching

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

Feature selection. LING 572 Fei Xia

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Micro-blogging Sentiment Analysis Using Bayesian Classification Methods

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

News-Oriented Keyword Indexing with Maximum Entropy Principle.

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Trends Manipulation and Spam Detection in Twitter

TEXT CATEGORIZATION PROBLEM

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

Sentiment Analysis of Customers using Product Feedback Data under Hadoop Framework

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

SPAM REVIEW DETECTION ON E-COMMERCE SITES

TRANSDUCTIVE TRANSFER LEARNING BASED ON KL-DIVERGENCE. Received January 2013; revised May 2013

Web-based Question Answering: Revisiting AskMSR

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

A hybrid method to categorize HTML documents

SHILLING ATTACK DETECTION IN RECOMMENDER SYSTEMS USING CLASSIFICATION TECHNIQUES

International ejournals

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Bug or Not? Bug Report Classification using N-Gram IDF

Part I: Data Mining Foundations

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

Mining the Query Logs of a Chinese Web Search Engine for Character Usage Analysis

Automatic Labeling of Issues on Github A Machine learning Approach

Tri-modal Human Body Segmentation

Categorization of Sequential Data using Associative Classifiers

Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers

Automated Tagging for Online Q&A Forums

Mining Social Media Users Interest

Northeastern University in TREC 2009 Million Query Track

An Empirical Study of Lazy Multilabel Classification Algorithms

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Applying Supervised Learning

Mahek Merchant 1, Ricky Parmar 2, Nishil Shah 3, P.Boominathan 4 1,3,4 SCOPE, VIT University, Vellore, Tamilnadu, India

Domain-specific Concept-based Information Retrieval System

Keyword Extraction by KNN considering Similarity among Features

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Query-based Text Normalization Selection Models for Enhanced Retrieval Accuracy

CLASSIFICATION FOR SCALING METHODS IN DATA MINING

Semi-Supervised Clustering with Partial Background Information

Transcription:

Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. 1 Introduction The recent development of Web 2.0 has generated plentiful of user-created content. Among various types of user-generated data on the web, reviews about businesses, products, or services written by the users are becoming more and more important due to the word-of-mouth effect and their impact on influencing consumer s purchase decisions. However, with the increasing popularity of online review websites such as TripAdvisor 1 and Yelp 2, malicious users start to abuse the convenience of publishing online reviews and deliberately post low quality, untrustworthy, or even fraudulent reviews. Such spam reviews can result in significant financial gains for organizations and individuals, and meanwhile lead to negative impact on their competitors. For example, a few recent studies have reported a new category of business which hires people to write positive reviews for some companies to attract users awareness and increase the profits 3. Spam reviews undoubtedly reduce the quality of reviews. They may even mislead users to make wrong purchase decisions. Therefore, there is a great demand to detect spam reviews thoroughly on the web. Recently, several existing studies investigated various machine learning techniques to automatically construct spam classification models based on specific features (Ott, Choi, Cardie, & Hancock, 2011; Mihalcea & Strapparava, 2009). For example, Ott et al. (2011) investigated the lexical features such as the frequency of verbs used in the reviews. Their results indicated that those lexical features are useful for building a spam classification models for spam reviews. Despite the usefulness of many lexical features, the classification performance of existing models for spam review detection is still far from satisfactory. An interesting direction to explore is whether some other features, such as users sentiments and feelings, as well as many other linguistic features of reviews would be incorporated into the classification model. In this paper, we focus on various types of linguistic features in users reviews such as the number of pronouns, psychological features such as the affective processes, current concerns such as degree of leisure, spoken features such as degree of assent, and punctuation such as number of colons. We evaluated the spam classification performance by considering more than 40 different classification algorithms on a spam review benchmark dataset. Our experimental results verified that the combination of linguistic features with some others (e.g., the frequency of words) could improve the detection performance over the state-of-the-art method, reaching more than 93% accuracy. 1 http://tripadvisor.com 2 http://www.yelp.com 3 http://www.cnet.com/news/fake-reviews-prompt-belkin-apology/

The remainder of the paper is organized as follows: in Section 2, we briefly review some relevant studies. In Section 3, we present the set of features generated from Linguistic Inquiry and Word Count tool in detail. An empirical study was conducted to verify the effectiveness of our method and the results are provided in Section 4. Finally, we present a summary, limitations, and future directions in Section 5. 2 Related Work The spam detection has different domains including Web (Castillo et al., 2006), Email (Chirita, Diederich, & Nejdl, 2005), and SMS (Karami & Zhou, 2014a, 2014b). The problem of opinion spam was introduced by investigating supervised learning techniques to detect fake reviews (Jindal & Liu, 2008). Lim et al. (2010) tracked the behavior of review spammers and found some specific behaviors such as targeting specific products or product groups in order to maximize their impact (Lim, Nguyen, Jindal, Liu, & Lauw, 2010). Using machine learning techniques is one of the popular approaches in online review spam detection. Some examples are employing standard word and part-of-speech (POS) n-gram features for supervised learning (Ott et al., 2011), using a graph-based method to find fake store reviewers (Wang, Xie, Liu, & Yu, 2011), and using frequent pattern mining to find groups of reviewers who frequently write reviews together (Mukherjee, Liu, Wang, Glance, & Jindal, 2011; Mukherjee, Liu, & Glance, 2012). 3 Feature Encoding In websites with online reviews, there are a set of k online reviews R = {r 1,..., r k}. Each message consists of words, numbers, etc. R can be about any topic. The online review spam detection problem can be described as the prediction of whether r i is a deceptive positive review using classifier c. A classifier predicts whether or not r i is a deceptive positive reviews. c : r i {deceptive reviews, truthful reviews} For classification, we need to extract a set of n features F = {f 1,..., f n} from R. We outline two groups of features to detect deceptive opinion spam trained on the dataset. 3.1 Developed-LIWC Features Linguistic Inquiry and Word Count (LIWC) 4 is a tool that analyzes 80 different features of text including linguistic processes such as number of pronouns, psychological processes such as affective processes, current concerns such as degree of leisure, spoken features such as degree of assent, and punctuation such as number of colons. While the function of LIWC is not text classification, we can see each output of LIWC as a feature. Although all or some of LIWC features have been used in other research for spam detection (Ott et al., 2011; Zhou, Burgoon, Twitchell, Qin, & Nunamaker Jr, 2004; Karami & Zhou, 2014b), we also incorporated an additional 224 features, LIWC +, based on various combinations of the raw features collected from LIWC. For instance, we derived the relative polarity by examining the difference between the positive and negative feelings scores. Table 1 shows a sample of our new features. To the best of our knowledge, these kinds of linguistic and sentiment features have not yet been explored for online review spam detection. We call our extension of LIWC Developed-LIWC (D-LIWC)" with 304 features including both LIWC and LIWC + features. The rate of Verb score to All Words score The rate of Parentheses score to the All Punctuation score The rate of Negative Feelings score to the all Affective Feelings score The difference between the score of words related to Family and the score of words related to Humans The difference between the score of words related to Leisure and the score of words related to Money Table 1: A Sample of LIWC + Features 4 http://www.liwc.net/ 2

3.2 Unigram Features In text categorization, one of popular techniques for feature extraction is n-gram including Unigrams, Bigrams, and Trigrams. The second strategy for feature extraction in this research is bag-of-words or Unigrams. We consider this category of n-grams as one of popular features in text categorization. By using this strategy, we can find frequency of each word for reviews. We will investigate bigrams and trigrams in our future work. 4 Experimental Results In this section, we discuss our empirical evaluation of our approach against the best result in the literature using more than 40 classification algorithms from Random Forest to Naive Bayes. In the experiment, we use Weka 1 for classification evaluation and leverage one available dataset in this research. This dataset 2 is a labeled corpus with 800 positive reviews including 400 truthful positive reviews and 400 deceptive positive reviews. For all classification algorithms, we use 80% of data for training and 20% for testing, with 5-fold cross validation. 4.1 Classifier Performance In order to determine whether a review is a deceptive-review or truthful-review, we adopt supervised machine learning algorithms. The goal of classification is to classify a review into deceptive-reviews or truthful-reviews. For evaluating the performance of spam detection, we measure F-measure and Accuracy. Features from the approaches in section 3 are used to train classification algorithms using the Weka machine learning toolkit. Among the classification algorithms Support Vector Machine (SVM) shows the best performance. This technique finds a high dimensional separating hyperplane between two groups of data. We use SVM to train two approaches with different combination of features, namely D-LIWC, LIWC, and Unigrams features. One of the contributions of this paper is to use a feature selection approach to improve the performance of classifiers and also reduce the number of data features to avoid negative effects of the high dimension problem in analyzing large number of online reviews. We chose Chi-Square as the feature selection method, which is among the most effective methods (Yang & Pedersen, 1997). Table 2 presents the classification performance using different sets of the features in section 3. Number of Features Accuracy F-Measure Deceptive Reviews Truthful Reviews LIWC All Features (80) 76.8% 76.9 76.6 Selected Features (60) 78.6% 79.1 78.1 D-LIWC All Features (304) 78.72% 79.3 78.1 Selected Features (125) 79.22% 79.8 78.6 Unigrams All Features (4965) 88.4% 88.6 88.2 Selected Features (2000) 89.11% 89.4 88.8 Unigrams+LIWC All Features (5044) 85.98% 86.5 85.4 Selected Features (2000) 88.86% 89.2 88.5 Unigrams+D-LIWC All Features (5268) 90.62% 91.3 89.8 Selected Features (2500) 93.42% 93.8 93.1 Ott et al. (2011) All Features (N/A) 89.8% 89.8 89.8 Table 2: Classifier Performance We observe that the SVM classifier outperforms the performance of Ott et. al (2011) using Bigrams and LIWC as the best performance in the literature (Table 2). In addition, we explore different number of features to reduce the number of features for handling high sparse dimension and complexity problems in a large number of reviews. In table 2, there are two sets of features: the first one is all features in the categories 1 http://www.cs.waikato.ac.nz/ml/weka/ 2 http://myleott.com/op_spam/ 3

such as LIWC and the second one is the best performance over different sets of top selected features. For example, the accuracy of all 80 LIWC features is 76.8% and the accuracy of selected 60 features by Chi-Square is 78.6%. We also track the proportion of the features in top n features from n = 10 to n = 100 features. The result shows that most of top features belong to LIWC + (Figure 1). For example, the proportion of features including LIWC +, Unigrams, and LIWC in top 100 features are 56%, 26%, and 18%. 60 50% 40 20 10 20 30 40 50 60 70 80 90 100 Figure 1: Features Proportion in Top n Features 5 Conclusion The recent surge of Web 2.0 makes the emerging communication media such as online review websites particularly attractive for malicious users. The challenge of detecting spam reviews in those websites is mainly due to the huge size of review data and its difficulty in detection. Existing research on online review spam detection has focused on word statistics. Surprisingly, not much work has been conducted to examine deeper semantic categories of text expressions. In this paper, we proposed to employ categories of lexical semantic and linguistic features in the detection of online spam reviews. Our experiment results showed that by incorporating many linguistic features of reviews, the detection performance of spam reviews can be greatly improved, comparing with the state-of-the-art methods. There are several interesting directions to explore in the future including exploring our approach on negative opinions, as well as opinions coming from other domains such as product reviews. In addition, we are interested in exploring latent semantic features such as grouping the words with similar meaning using topic modeling. References Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., & Vigna, S. (2006). A reference collection for web spam. In Acm sigir forum (Vol. 40, pp. 11 24). Chirita, P.-A., Diederich, J., & Nejdl, W. (2005). Mailrank: using ranking for spam detection. In Proceedings of the 14th acm international conference on information and knowledge management (pp. 373 380). Jindal, N., & Liu, B. (2008). Opinion spam and analysis. In Proceedings of the 2008 international conference on web search and data mining (pp. 219 230). Karami, A., & Zhou, L. (2014a). Exploiting latent content based features for the detection of static sms spams. In Proceedings of the 77th annual meeting of the association for information science and technology (ASIST). Karami, A., & Zhou, L. (2014b). Improving static SMS spam detection by using new content-based features. In 20th americas conference on information systems (AMCIS). Lim, E.-P., Nguyen, V.-A., Jindal, N., Liu, B., & Lauw, H. W. (2010). Detecting product review spammers using rating behaviors. In Proceedings of the 19th acm international conference on information and knowledge management (pp. 939 948). Mihalcea, R., & Strapparava, C. (2009). The lie detector: Explorations in the automatic recognition of deceptive language. In Proceedings of the acl-ijcnlp 2009 conference short papers (pp. 309 312). Mukherjee, A., Liu, B., & Glance, N. (2012). Spotting fake reviewer groups in consumer reviews. In Proceedings of the 21st international conference on world wide web (pp. 191 200). Mukherjee, A., Liu, B., Wang, J., Glance, N., & Jindal, N. (2011). Detecting group review spam. In Proceedings of the 20th international conference companion on world wide web (pp. 93 94). 4

Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011). Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 309 319). Wang, G., Xie, S., Liu, B., & Yu, P. S. (2011). Review graph based online store review spammer detection. In Data mining (icdm), 2011 ieee 11th international conference on (pp. 1242 1247). Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Icml (Vol. 97, pp. 412 420). Zhou, L., Burgoon, J. K., Twitchell, D. P., Qin, T., & Nunamaker Jr, J. F. (2004). A comparison of classification methods for predicting deception in computer-mediated communication. Journal of Management Information Systems, 20 (4), 139 166. Table of Figures Figure 1 Features Proportion in Top n Features... 4 Table of Tables Table 1 A Sample of LIWC + Features... 2 Table 2 Classifier Performance... 3 5