Analyzing and Detecting Review Spam

Size: px
Start display at page:

Download "Analyzing and Detecting Review Spam"

Transcription

1 Seventh IEEE International Conference on Data Mining Analyzing and Detecting Review Spam Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago Abstract Mining of opinions from product reviews, forum posts and blogs is an important research topic with many applications. However, existing research has been focused on extraction, classification and summarization of opinions from these sources. An important issue that has not been studied so far is the opinion spam or the trustworthiness of online opinions. In this paper, we study this issue in the context of product reviews. To our knowledge, there is still no published study on this topic, although Web page spam and spam have been investigated extensively. We will see that review spam is quite different from Web page spam and spam, and thus requires different detection techniques. Based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that review spam is widespread. In this paper, we first present a categorization of spam reviews and then propose several techniques to detect them. 1. Introduction The Web has dramatically changed the way that people express themselves and interact with others. They can now post reviews of products at merchant sites (e.g., amazon.com) and express their views in blogs and forums. Such content contributed by Web users is called the user-generated content. It is now well recognized that the user generated content contains valuable information that can be exploited for many applications. In this paper, we only focus on product reviews. In particular, we investigate review spam. It is now a common practice for e-commerce Web sites to enable their customers to write reviews of products that they have purchased. The reviews are then used by potential customers to find opinions of existing users before purchasing the products. They are also used by manufacturers to identify problems in their products and/or to find competitive intelligence information about their competitors [3, 6, 11]. In the past few years, there was a growing interest in mining opinions expressed in reviews due to many practical applications. Reviews are useful to both individual consumers and product manufacturers. For example, if one wants to buy a product, one typically goes to a merchant site (e.g., amazon.com) to read some reviews of existing users of the product. If the reviews are mostly positive, one is very likely to buy the product. If the reviews are mostly negative, one will most likely choose another product. Positive opinions can result in significant financial gains or fames for organizations and individuals. This gives good incentives for review spam. Existing work has been focused on extracting and summarizing opinions in reviews [3, 6]. Little is known about the trustworthiness of reviews or detection of spam. Review spam is similar to Web page spam. In the context of search, due to the economic and/or publicity value of the rank position of a Web page returned by a search engine, Web page spam is widespread [1, 13]. Web page spam refers to the use of illegitimate means to boost the rank positions of some target pages in search engines [5, 9]. In the context of reviews, the problem is similar, but also quite different (see Section 2). In this paper, we study review spam based on 5.8 million reviews and 2.14 million reviewers (members who wrote at least one review) from amazon.com. We discovered that spam activities are widespread. For example, we found a large number of duplicate and near-duplicate reviews written by the same reviewers on different products or by different reviewers (possibly different userids of the same persons) on the same products or different products. This paper makes the following two main contributions: (1) Review spam categorization: It presents a categorization of review spam. We found three main types of spam reviews. To our knowledge, this is the first report of such a categorization. It will form the basis for future research of review spam. (2) Review spam analysis and detection: It proposes some novel techniques to study review spam and spam detection. In general, spam detection can be /07 $ IEEE DOI /ICDM

2 regarded as a classification problem with two classes, spam and non-spam. However, due to the specific nature of different types of spam, we need to deal with them differently. For two types of spam reviews, we can detect them based on supervised learning because these two types of reviews are recognizable manually and thus training data can be labeled manually. However, for the type of spam reviews, which we call false opinions, manual labeling by simply reading reviews is very hard, if not impossible, because a spammer can carefully craft a spam review to promote a target product or to damage the reputation of another product that is just like any innocent review. We then discuss a novel way to study this problem using some duplicate reviews which are almost certainly spam. 2. Related Work Although mining opinions (positive and negative) from reviews became a popular research topic in recent years [3, 6], there is still no reported study on review spam. Here we only discuss some existing research on other types of spam. Perhaps, the most extensively studied topic on spam is Web spam. Web spam can be categorized into two main types: content spam and link spam. Link spam is spam on hyperlinks, which does not exist in reviews as there is usually no link among reviews. Content spam tries to add irrelevant or remotely relevant words in target pages to fool search engines to rank the target pages high. A taxonomy of Web spam is given in [5]. Many researchers have studied this problem [e.g., 1, 5, 13]. Review spam is very different. Adding irrelevant words has little effect. Instead, spammers write undeserving positive reviews to promote some objects and/or malicious negative reviews to damage the reputation of some other objects. These false opinion spam reviews are very hard to detect. Another related research is spam [4, 12], which is also quite different from review spam. spam usually refers to unsolicited commercial advertisements. Although exist, advertisements in reviews are not as frequent as in s. They are also relatively easy to detect (see Section 4.2.3). Recent studies on spam also extended to recommender systems [8]. Although the objectives of spam on recommender systems are similar to review spam, their basic ideas are different. In recommender systems, a spammer injects some attack profiles to the system in order to get some products more (or less) frequently recommended. A profile is a set of ratings (e.g., 1-5) for a series of products. The spammer usually does not see other users rating profiles and thus has to make guesses. In the context of product reviews, a reviewer sees all reviews for every product. Rating is only part of a review and another main part is the review text. [14] studies the utility of reviews using natural language features. Spam is a much broader concept involving all types of objectionable activities. 3. Categorization of Spam Reviews We now present the review spam categorization, which is compiled based on extensive analysis of customer reviews from amazon.com Review Data from Amazon.com In this work, we use reviews from amazon.com. The reason for using this data set is that it is large, covers a very wide range of products and has a relatively long history. It is a reasonable representative review data set. The reviews were crawled in June We were able to extract 5.8 million reviews, 2.14 million reviewers and 6.7 million products (the exact number of products offered by amazon.com could be much larger since amazon.com only displays a maximum of 9600 products for each sub-category). Each amazon.com s review consists of 8 fields: <Product ID> <Reviewer ID> <Rating> <Date> <Review Title> <Review Body> <Number of Helpful Feedbacks> <Number of Feedbacks>. We used 4 main categories of products in our study, i.e., Books, Music, DVD and mproducts (manufactured products such as electronics, computers, etc). The number of reviews, reviewed products and reviewers in each category in our study is given in Table 1. Table 1. Number of reviews, reviewed products and reviewers Category Reviews Reviewed Products Reviewers All Books DVD Music mproducts Categorization of Review Spam There are three main types of spam reviews. Type 1 (False Opinions): Such reviews contain false opinions on products and are thus very harmful. a. Positive spam review: Such a review expresses an undeserving positive opinion on a product with the agenda of promoting the product. b. Negative spam review: Such a review expresses a malicious negative opinion on a product with the intension of damaging its reputation

3 Type 2 (Reviews on brands only): Such reviews do not comment on the product itself but only express opinions on the brand (or manufacturer or seller), e.g., I don t trust HP, and never bought anything from them. Although this review expresses an opinion, it is not on the specific product and can often be highly biased. Type 3 (Non-reviews): Such reviews contain no opinions, and thus do not serve the purpose of reviews. Although they may not affect human users who read them as they can be recognized easily, they affect automated opinion mining systems that aggregate review ratings because these reviews also contain ratings which may just be randomly assigned. There are two main sub-categories. Advertisements: In such reviews, reviewers list a set of product features or accessories. Although they may not contain any false information, they are considered spam as they contain no opinions. There are three main kinds of advertisements: a. Same product: The review describes some features or use of the product, e.g., Detailed Product Specs: Standards * g, b, INMPR Compliant, TCP/IP, UPnP AV 1.0, USB 2.0,., which simply lists all product features. b. Different Product: The review promotes some competing products from the same or different brand. This is similar to the above case, but advertising for a different product. c. Different Seller: The review promotes a different seller or Web site for the product, e.g., This is a great product but can be bought for less at: compuplus.com, which advertises for a competing site selling the same product. Other non-reviews: The rest of non-reviews also consist of several types: a. Question or answer. The reviewer asks or answers questions or doubts about the product from fellow reviewers, e.g., What port it is for? AGP or PCI Express?? From the looks of the picture it seems like the PCI Express version. Can anyone confirm this?, which asks a question about a graphics card. b. Comment. The review comments on some other reviews, e.g., This Other Review is too funny. c. Random text. The review just contains some random text completely unrelated to the product, e.g., Go Eagles Go, which is for adobe acrobat and is unrelated to the product. 4. Spam Detection In general, spam detection can be regarded as a classification problem with two classes, spam and nonspam. Machine learning models may be built to classify each review as spam or non-spam, or to give a probability likelihood of each review being a spam. To build a classification model, we need labeled training examples of both spam reviews and non-spam reviews. However, for the three types of spam, we can only manually label training examples for type 2 and type 3 as they are recognizable based on their contents. Recognizing whether a review is a false opinion spam (type 1), however, is extremely difficult by reading the review because one can carefully craft a spam review which is just like any innocent review. We tried to read a large number of reviews and were unable to reliably identify type 1 spam reviews manually. Thus, other means have to be explored in order to find training examples for detecting possible type 1 spam reviews. Interestingly, in our analysis, we found a large number of duplicate and near-duplicate reviews. Our manual inspection of such reviews shows that they definitely contain type 2 and type 3 spam reviews. We are also sure that they contain type 1 spam reviews because of the following types of (near-) duplicates: 1. Duplicates from different userids on the same product. 2. Duplicates from the same userid on different products. 3. Duplicates from different userids on different products. Most of such reviews (excluding types 2 and 3 spam) are almost certainly false opinion spam (type 1). Our spam detection strategy: (1) detect duplicates and near-duplicates, (2) detect spam reviews of type 2 and type 3 based on supervised learning using manually labeled training examples, and (3) detect type 1 spam by exploiting the three types of duplicates above and other relevant information Detection Duplicate Reviews Duplicates and near-duplicates can be detected using the shingle method in [2]. In this work, we use 2-gram based review content comparison. Review pairs with similarity score of at least 90% were chosen as duplicates. Fig. 1 plots the log of the number of review pairs with the similarity scores for four different product sub-categories, each belonging to one of the four major categories: books, music, DVDs and mproducts. The sub-categories are word literature ( reviews), progressive music (65682 reviews), drama ( reviews), and office electronic products (22020 reviews). All the sub-categories behave similarly. We also compared the reviews of other sub-categories. The behaviors are about the same. Due to space limitations, we are unable to show all of them

4 Num Pairs Office Electronics Drama DVDs Word Literature Books Progeressive Music Similarity Score Fig. 1. Similarity scores and number of pairs of reviews from different sub-categories: Points on X-axis are intervals. For example, 0.5 means the interval [0.5, 0.6). Fig. 1 shows that the number of pairs decreases as the similarity score increases. It rises after the similarity score of 0.5 and 0.6. The rise is likely due to the cases that people copied their reviews on one product to another or to the same product. Further study shows that about 10% of the reviewers with more than one review have duplicate reviews. In 40% of these cases, the reviews were written on the same day and were exact duplicates. In 30% of the cases, reviews were written on the same day but had some attributes that were different. Note that in many cases if a person has more than one review on a particular product, these reviews are mostly exact duplicates. However, we do not regard them as spam as they could be due to clicking the submit button more than once. We checked the amazon.com site and found that this was indeed possible. Some others were also due to correction of mistakes in previous submissions. For spam removal, we can delete all duplicate reviews which belong to any one of the three types described above. For other kinds of duplicates, we may want to keep only the last copy and remove the rest. Table 2 shows the numbers of likely spam reviews in the above three categories. The first number in column 2 of each row is the number of such reviews in the whole review database. The second number within () is the number of such cases in the category mproducts. In the following study, we focus only on reviews in the category of mproducts, which has reviews. Reviews in other categories can be studied similarly. Note that in some cases, the same person writes the same review for different versions of the same product (hardcover and paper cover of the same book) may not be spam. Out of the total of 4488 reviews, about 30% of them are from reviewers on more than one product. We manually checked the products which had exactly the same reviews. We found that these products have at least one feature different, e.g., two televisions with Table 2. Three types of duplicate spam reviews on all products and on category mproducts Num Reviews Spam Review Type (mproducts) 1 Different userids on the same product 3067 (104) 2 Same userid on different products (4270) 3 Different userids on different products 1383 (114) Total (4488) different dimensions. We labeled them as the same or different products based on the significance of the features that are different. Only a small percentage of products were labeled as the same, and many duplicate reviews on these products were also suspicious. Thus we consider all such duplicates as spam Detecting Type 2 & Type 3 Spam Reviews As we mentioned in Section 3, type 2 and types 3 spam reviews are recognizable manually. Thus, we use supervised learning to detect them. We manually labeled 470 spam reviews of the two types. The breakdown is given in column 2 of Table 3 in Section We did not label more as the proportion of such reviews is extremely small. Manual labeling is very time-consuming. Based on this set of labeled examples, we are already able to achieve very good classification results (Section 4.2.3) Model Building Using Logistic Regression For model building, we used logistic regression. The reason for using logistic regression is that it produces a probability estimate that each review is a spam review, which is highly desirable. In practice, the probabilistic output can be used to weight each review. Since the probability reflects the likelihood that a review is a spam, those reviews with high probabilities can be weighted down to reduce their effects on opinion mining; thus, no need to remove any review as spam. We used the statistical package R ( to perform logistic regression. The AUC (Area under ROC Curve) is employed to evaluate the classification results, which is a standard measure used in machine learning to assess the model quality. Apart from using logistic regression, we also tried SVM, decision tree, and naïve Bayesian classification, but they gave poorer results and are thus not included. Below, we describe features used in learning Feature Identification and Construction There are three main types of information related to a review: (1) the content of the review, (2) the reviewer

5 who wrote the review, and (3) the product being reviewed. We thus have three types of features: (1) review centric features: characteristics of reviews. (2) reviewer centric features: characteristics of reviewers. (3) product centric features: characteristics of products. For some features, we need to divide products and reviews into three types based on their average ratings (rating scale: 1-5): Good (rating 4), bad (rating 2.5) and Average, otherwise Review Centric Features 1. Number of feedbacks (F1), number of helpful feedbacks (F2) and percent of helpful feedbacks (F3) that the review gets. Intuitively, feedbacks are useful in judging the review quality. 2. Length of review title (F4) and length of review body (F5). These features were chosen since longer reviews tend to get more user attention. So, a spammer might use this to his/her advantage. 3. Position of the review in the reviews of a product sorted by date, in ascending (F6) and descending (F7) order. These features were chosen since earlier reviews tend to have more impact on the sale of a product and thus may be exploited by spammers. We also use binary features to indicate if a review is the first review (F8) or the only review (F9). 4. Textual features: a. Percent of positive (F10) and negative (F11) opinion-bearing words in the review, e.g., beautiful, bad and poor. We obtained the list of words from the authors of [6]. We also added a set of other words of our own. b. Cosine similarity (F12) of the review and product features (which are obtained from the product description page at amazon.com). c. Percent of times the brand name (F13) is mentioned in the review. This feature was used for reviews which praise or criticize the brand. d. Percent of numerals (F14), capitals (F15) and all capital (F16) words. Excessive use of numerals signifies too much technical detail (thus nonreviews). Capitals and all capitals signify poorly written and unrelated reviews. 5. Rating related features a. Rating (F17) of review and its deviation (F18) from the average product rating. Feature (F19) indicating if the review is good, average or bad. b. Binary features indicating whether a bad review was written just after the first good review of the product and vice versa (F20, F21). Reviewer Centric Features 1. Ratio of number of reviews that the reviewer wrote which were the first reviews (F22) of the products to the total number of reviews that he/she wrote, and ratio of the number of cases in which he/she was the only reviewer (F23). 2. Rating related features: Average rating given by the reviewer (F24), standard deviation in rating (F25) and a feature indicating if the reviewer always gave only good, average or bad rating (F26). 3. Binary features indicating whether the reviewer gave more than one type of rating, i.e. good, average and bad. There are four cases: a reviewer gave both good and bad ratings (F27), good rating and average rating (F28), bad rating and average rating (F29) and all three ratings (F30). 4. Percent of times that the reviewer wrote a review with binary features F20 (F31) and F21 (F32). Product Centric Features 1. Price (F33) of the product. 2. Sales rank (F34) of the product. These features are used since spam may be focused on cheap/expensive or less selling products. 3. Average rating (F35) and standard deviation in ratings (F36) of the reviews on the product Experimental Results We run logistic regression on the data using 470 spam reviews for positive class and rest of the reviews for negative class. Spam reviews discussed in Section 4.1 are not used since they are duplicates. The average AUC values based on 10-fold cross validation are given in Table 3. Table 3. AUC values for different types of spam Spam Type Num reviews AUC AUC text features only AUC w/o feedbacks Types 2 & % 90% 98% Type 2 only % 88% 98% Type 3 only % 92% 98% From the table, we observe that the AUC value for all spam types is 98.7%. Using only textual features does not perform as well. Without using feedback features, the same results can be achieved. This is important because feedbacks can be spammed too. 5. Type 1 Spam Reviews Section 4 allows us to conclude that that type 2 and type 3 spam reviews are fairly easy to detect. Detecting type 1 spam reviews is, however, very difficult. As we mentioned earlier, it is almost impossible to recognize type 1 spam manually. Thus, we do not have manually labeled training data for learning. In order to investigate type 1 spam reviews, let us

6 first analyze what kinds of reviews are harmful and are likely to be spammed. Recall that type 1 spammer aims (1) to promote some target objects by writing undeserving positive reviews on them, and/or (2) to damage the reputation of some other objects by writing malicious negative reviews on them. To achieve the above two objectives, a spam review s rating needs to deviate from the average product rating (outlier reviews). For example, a spam review should give negative rating to a good product. Clearly, a spam review which gives a positive rating to a good product is not very harmful. Thus, spam detection should focus on outlier reviews. Making use of duplicates: Since we have no manually labeled examples to build spam detection model to identify type 1 spam, we have to look from other sources. A natural choice is the three types of duplicates discussed in Section 4.1, which are almost certainly spam reviews. That is, we use these duplicates as positive examples (spam reviews) and the rest of the reviews as negative examples (non-spam reviews). We still use logistic regression for model building based on the same set of features as described in Section (no feature overfits duplicates. 10-fold cross validation was able to give us the AUC of 78% for duplicate reviews, which is quite high considering that non-duplicate reviews also contain spam. Using this model, we tried to classify many types of interesting reviews and found: 1. The model built using duplicates is able to predict several types of outlier reviews (harmful spam reviews are outlier reviews, but vice versa). 2. User feedback on reviews is not effective in filtering out spam. 3. Many top-ranked reviewers may have written spam reviews. 4. Products with only a single review are very likely to be spammed. Due to space limitations, we are unable to provide the detailed analysis and results, which will appear in a future publication. 6. Conclusions This paper studied review spam and spam detection (apart from our earlier poster [7]). Three main types of spam were identified. Detection of such spam is done first by detecting duplicate reviews. We then detected type 2 and type 3 spam reviews by using supervised learning with manually labeled training examples. Results showed that the logistic regression model is highly effective. However, to detect type 1 spam reviews, the story is quite different because it is very hard to manually label training examples for type 1 spam. We presented an approach to use three kinds of duplicates, which are very likely to be spam, as positive training examples to build a classification model. The results are promising. The current study only represents an initial investigation of review spam. Much work remains to be done. In our future work, we will further improve the detection methods, and also look into spam in other kinds of media, e.g., forums and blogs. 7. Acknowledgement This research was funded by Microsoft Corporation. We thank Ling Bo for many useful discussions. 8. References [1]. R. Baeza-Yates, C. Castillo & V. Lopez. PageRank increase under different collusion topologies. AIRWeb 05, [2]. A. Z. Broder. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences [3]. K. Dave, S. Lawrence & D. Pennock. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. WWW [4]. I. Fette, N. Sadeh-Koniecpol, A. Tomasic. Learning to Detect Phishing s. WWW [5]. Z. Gyongyi and H. Garcia-Molina. Web Spam Taxonomy. Tech. Report, Stanford University, [6]. M. Hu & B. Liu. Mining and summarizing customer reviews. KDD [7]. N. Jindal and B. Liu. Review Spam Detection. WWW (poster paper) [8]. B. Mobasher, R. Burke & J. J Sandvig. Modelbased collaborative filtering as a defense against profile injection attacks. AAAI'2006. [9]. A. Ntoulas, M. Najork, M. Manasse & D. Fetterly. Detecting Spam Web Pages through Content Analysis. WWW [10]. B. Pang, L. Lee & S. Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. EMNLP [11]. A-M. Popescu and O. Etzioni. Extracting Product Features and Opinions from Reviews. EMNLP [12]. M. Sahami, S. Dumais, D. Heckerman and E. Horvitz. A Bayesian Approach to Filtering Junk {E}-Mail. AAAI Tech. Report WS-98-05, [13]. B. Wu, V. Goel & B. D. Davison. Topical TrustRank: using topicality to combat Web spam. WWW'2006. [14]. Z. Zhang & B. Varadarajan, Utility scoring of product reviews, CIKM

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology Seminar: Future Of Web Search University of Saarland Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood Tutor : Klaus Berberich Date : 17-Jan-2008 The Agenda

More information

Method to Study and Analyze Fraud Ranking In Mobile Apps

Method to Study and Analyze Fraud Ranking In Mobile Apps Method to Study and Analyze Fraud Ranking In Mobile Apps Ms. Priyanka R. Patil M.Tech student Marri Laxman Reddy Institute of Technology & Management Hyderabad. Abstract: Ranking fraud in the mobile App

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008 Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification

More information

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham

More information

Detecting Spam Web Pages

Detecting Spam Web Pages Detecting Spam Web Pages Marc Najork Microsoft Research Silicon Valley About me 1989-1993: UIUC (home of NCSA Mosaic) 1993-2001: Digital Equipment/Compaq Started working on web search in 1997 Mercator

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Predict Topic Trend in Blogosphere

Predict Topic Trend in Blogosphere Predict Topic Trend in Blogosphere Jack Guo 05596882 jackguo@stanford.edu Abstract Graphical relationship among web pages has been used to rank their relative importance. In this paper, we introduce a

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Improving the methods of classification based on words ontology

Improving the methods of  classification based on words ontology www.ijcsi.org 262 Improving the methods of email classification based on words ontology Foruzan Kiamarzpour 1, Rouhollah Dianat 2, Mohammad bahrani 3, Mehdi Sadeghzadeh 4 1 Department of Computer Engineering,

More information

Detecting Opinion Spammer Groups and Spam Targets through Community Discovery and Sentiment Analysis

Detecting Opinion Spammer Groups and Spam Targets through Community Discovery and Sentiment Analysis Journal of Computer Security (28) IOS Press Detecting Opinion Spammer Groups and Spam Targets through Community Discovery and Sentiment Analysis Euijin Choo a,, Ting Yu b Min Chi c a Qatar Computing Research

More information

Tutorials Case studies

Tutorials Case studies 1. Subject Three curves for the evaluation of supervised learning methods. Evaluation of classifiers is an important step of the supervised learning process. We want to measure the performance of the classifier.

More information

Classifying Spam using URLs

Classifying Spam using URLs Classifying Spam using URLs Di Ai Computer Science Stanford University Stanford, CA diai@stanford.edu CS 229 Project, Autumn 2018 Abstract This project implements support vector machine and random forest

More information

ISSN (PRINT): , (ONLINE): , VOLUME-5, ISSUE-2,

ISSN (PRINT): , (ONLINE): , VOLUME-5, ISSUE-2, FAKE ONLINE AUDITS DETECTION USING MACHINE LEARNING Suraj B. Karale 1, Laxman M. Bharate 2, Snehalata K. Funde 3 1,2,3 Computer Engineering, TSSM s BSCOER, Narhe, Pune, India Abstract Online reviews play

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Automatic Query Type Identification Based on Click Through Information

Automatic Query Type Identification Based on Click Through Information Automatic Query Type Identification Based on Click Through Information Yiqun Liu 1,MinZhang 1,LiyunRu 2, and Shaoping Ma 1 1 State Key Lab of Intelligent Tech. & Sys., Tsinghua University, Beijing, China

More information

Fraudulent Support Telephone Number Identification Based on Co-occurrence Information on the Web

Fraudulent Support Telephone Number Identification Based on Co-occurrence Information on the Web Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Fraudulent Support Telephone Number Identification Based on Co-occurrence Information on the Web Xin Li, Yiqun Liu, Min Zhang,

More information

Best Customer Services among the E-Commerce Websites A Predictive Analysis

Best Customer Services among the E-Commerce Websites A Predictive Analysis www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 5 Issues 6 June 2016, Page No. 17088-17095 Best Customer Services among the E-Commerce Websites A Predictive

More information

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Exploring both Content and Link Quality for Anti-Spamming

Exploring both Content and Link Quality for Anti-Spamming Exploring both Content and Link Quality for Anti-Spamming Lei Zhang, Yi Zhang, Yan Zhang National Laboratory on Machine Perception Peking University 100871 Beijing, China zhangl, zhangyi, zhy @cis.pku.edu.cn

More information

Classification. I don t like spam. Spam, Spam, Spam. Information Retrieval

Classification. I don t like spam. Spam, Spam, Spam. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Classification applications in IR Classification! Classification is the task of automatically applying labels to items! Useful for many search-related tasks I

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CS535 Big Data Fall 2017 Colorado State University   10/10/2017 Sangmi Lee Pallickara Week 8- A. CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE

More information

SMART LIVE CHAT LIMITER

SMART LIVE CHAT LIMITER Technical Disclosure Commons Defensive Publications Series June 26, 2017 SMART LIVE CHAT LIMITER Kurt Wilms Follow this and additional works at: http://www.tdcommons.org/dpubs_series Recommended Citation

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,

More information

Web Spam Challenge 2008

Web Spam Challenge 2008 Web Spam Challenge 2008 Data Analysis School, Moscow, Russia K. Bauman, A. Brodskiy, S. Kacher, E. Kalimulina, R. Kovalev, M. Lebedev, D. Orlov, P. Sushin, P. Zryumov, D. Leshchiner, I. Muchnik The Data

More information

CSC 2515 Introduction to Machine Learning Assignment 2

CSC 2515 Introduction to Machine Learning Assignment 2 CSC 2515 Introduction to Machine Learning Assignment 2 Zhongtian Qiu(1002274530) Problem 1 See attached scan files for question 1. 2. Neural Network 2.1 Examine the statistics and plots of training error

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

MARKETING VOL. 1

MARKETING VOL. 1 EMAIL MARKETING VOL. 1 TITLE: Email Promoting: What You Need To Do Author: Iris Carter-Collins Table Of Contents 1 Email Promoting: What You Need To Do 4 Building Your Business Through Successful Marketing

More information

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms DEWS2008 A10-1 A Method for Finding Link Hijacking Based on Modified PageRank Algorithms Young joo Chung Masashi Toyoda Masaru Kitsuregawa Institute of Industrial Science, University of Tokyo 4-6-1 Komaba

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Anti-Trust Rank for Detection of Web Spam and Seed Set Expansion

Anti-Trust Rank for Detection of Web Spam and Seed Set Expansion International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 4 (2013), pp. 241-250 International Research Publications House http://www. irphouse.com /ijict.htm Anti-Trust

More information

2. On classification and related tasks

2. On classification and related tasks 2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

More information

Network-based recommendation: Using graph structure in user-product rating networks to generate product recommendations

Network-based recommendation: Using graph structure in user-product rating networks to generate product recommendations Introduction. Abstract Network-based recommendation: Using graph structure in user-product rating networks to generate product recommendations David Cummings Ningxuan (Jason) Wang Given

More information

SPAM REVIEW DETECTION ON E-COMMERCE SITES

SPAM REVIEW DETECTION ON E-COMMERCE SITES International Journal of Civil Engineering and Technology (IJCIET) Volume 9, Issue 7, July 2018, pp. 1167 1174, Article ID: IJCIET_09_07_123 Available online at http://www.iaeme.com/ijciet/issues.asp?jtype=ijciet&vtype=9&itype=7

More information

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications

More information

Page Rank Link Farm Detection

Page Rank Link Farm Detection International Journal of Engineering Inventions e-issn: 2278-7461, p-issn: 2319-6491 Volume 4, Issue 1 (July 2014) PP: 55-59 Page Rank Link Farm Detection Akshay Saxena 1, Rohit Nigam 2 1, 2 Department

More information

Sentiment Analysis for Amazon Reviews

Sentiment Analysis for Amazon Reviews Sentiment Analysis for Amazon Reviews Wanliang Tan wanliang@stanford.edu Xinyu Wang xwang7@stanford.edu Xinyu Xu xinyu17@stanford.edu Abstract Sentiment analysis of product reviews, an application problem,

More information

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering W10.B.0.0 CS435 Introduction to Big Data W10.B.1 FAQs Term project 5:00PM March 29, 2018 PA2 Recitation: Friday PART 1. LARGE SCALE DATA AALYTICS 4. RECOMMEDATIO SYSTEMS 5. EVALUATIO AD VALIDATIO TECHIQUES

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Previous: how search engines work

Previous: how search engines work detection Ricardo Baeza-Yates,3 ricardo@baeza.cl With: L. Becchetti 2, P. Boldi 5, C. Castillo, D. Donato, A. Gionis, S. Leonardi 2, V.Murdock, M. Santini 5, F. Silvestri 4, S. Vigna 5. Yahoo! Research

More information

INTERNATIONAL JOURNAL OF MERGING TECHNOLOGY AND ADVANCED RESEARCH IN COMPUTING ON MULTIMEDIA CONTENT TRUST MODELING APPROACHES SOCIAL TAGGING

INTERNATIONAL JOURNAL OF MERGING TECHNOLOGY AND ADVANCED RESEARCH IN COMPUTING ON MULTIMEDIA CONTENT TRUST MODELING APPROACHES SOCIAL TAGGING ON MULTIMEDIA CONTENT TRUST MODELING APPROACHES SOCIAL TAGGING [1] Soppari Swapna M.Tech(CSE) Sree Dattha Institute Of Engineering & Sciences, Hyd [2] L ROSHINI Assistant professor Computer Science Department

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

SENTIMENT ESTIMATION OF TWEETS BY LEARNING SOCIAL BOOKMARK DATA

SENTIMENT ESTIMATION OF TWEETS BY LEARNING SOCIAL BOOKMARK DATA IADIS International Journal on WWW/Internet Vol. 14, No. 1, pp. 15-27 ISSN: 1645-7641 SENTIMENT ESTIMATION OF TWEETS BY LEARNING SOCIAL BOOKMARK DATA Yasuyuki Okamura, Takayuki Yumoto, Manabu Nii and Naotake

More information

Fraud Detection of Mobile Apps

Fraud Detection of Mobile Apps Fraud Detection of Mobile Apps Urmila Aware*, Prof. Amruta Deshmuk** *(Student, Dept of Computer Engineering, Flora Institute Of Technology Pune, Maharashtra, India **( Assistant Professor, Dept of Computer

More information

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms DEWS2008 A10-1 A Method for Finding Link Hijacking Based on Modified PageRank Algorithms Young joo Chung Masashi Toyoda Masaru Kitsuregawa Institute of Industrial Science, University of Tokyo 4-6-1 Komaba

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

Effectively Detecting Content Spam on the Web Using Topical Diversity Measures 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology Effectively Detecting Content Spam on the Web Using Topical Diversity Measures Cailing Dong Department of

More information

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,

More information

Computer aided mail filtering using SVM

Computer aided mail filtering using SVM Computer aided mail filtering using SVM Lin Liao, Jochen Jaeger Department of Computer Science & Engineering University of Washington, Seattle Introduction What is SPAM? Electronic version of junk mail,

More information

Detecting Tag Spam in Social Tagging Systems with Collaborative Knowledge

Detecting Tag Spam in Social Tagging Systems with Collaborative Knowledge 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery Detecting Tag Spam in Social Tagging Systems with Collaborative Knowledge Kaipeng Liu Research Center of Computer Network and

More information

Automatic Identification of User Goals in Web Search [WWW 05]

Automatic Identification of User Goals in Web Search [WWW 05] Automatic Identification of User Goals in Web Search [WWW 05] UichinLee @ UCLA ZhenyuLiu @ UCLA JunghooCho @ UCLA Presenter: Emiran Curtmola@ UC San Diego CSE 291 4/29/2008 Need to improve the quality

More information

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam  Categorization An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

Topic Classification in Social Media using Metadata from Hyperlinked Objects

Topic Classification in Social Media using Metadata from Hyperlinked Objects Topic Classification in Social Media using Metadata from Hyperlinked Objects Sheila Kinsella 1, Alexandre Passant 1, and John G. Breslin 1,2 1 Digital Enterprise Research Institute, National University

More information

CS47300 Web Information Search and Management

CS47300 Web Information Search and Management CS47300 Web Information Search and Management Search Engine Optimization Prof. Chris Clifton 31 October 2018 What is Search Engine Optimization? 90% of search engine clickthroughs are on the first page

More information

Detecting Spam with Artificial Neural Networks

Detecting Spam with Artificial Neural Networks Detecting Spam with Artificial Neural Networks Andrew Edstrom University of Wisconsin - Madison Abstract This is my final project for CS 539. In this project, I demonstrate the suitability of neural networks

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

Detecting Blog Spam Hashtags Using Topic Modeling

Detecting Blog Spam Hashtags Using Topic Modeling Detecting Blog Spam Hashtags Using Topic Modeling Yoonjin Hyun Ph.D. Candidate, Graduate School of Business Information Technology, Kookmin University 77 Jeongneung-ro, Seongbuk-gu, Seoul, 02707, Korea

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

VisoLink: A User-Centric Social Relationship Mining

VisoLink: A User-Centric Social Relationship Mining VisoLink: A User-Centric Social Relationship Mining Lisa Fan and Botang Li Department of Computer Science, University of Regina Regina, Saskatchewan S4S 0A2 Canada {fan, li269}@cs.uregina.ca Abstract.

More information

A Survey Based on Product Usability and Feature Fatigue Analysis Methods for Online Product

A Survey Based on Product Usability and Feature Fatigue Analysis Methods for Online Product A Survey Based on Product Usability and Feature Fatigue Analysis Methods for Online Product Nirali Patel, Student, CSE, PIET, Vadodara, India Dheeraj Kumar Singh, Assistant Professor, Department of IT,

More information

DEFENDING AGAINST MALICIOUS NODES USING AN SVM BASED REPUTATION SYSTEM

DEFENDING AGAINST MALICIOUS NODES USING AN SVM BASED REPUTATION SYSTEM DEFENDING AGAINST MALICIOUS NODES USING AN SVM BASED REPUTATION SYSTEM Rehan Akbani, Turgay Korkmaz, and G. V. S. Raju {rakbani@cs.utsa.edu, korkmaz@cs.utsa.edu, and gvs.raju@utsa.edu} University of Texas

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Principles of Machine Learning

Principles of Machine Learning Principles of Machine Learning Lab 3 Improving Machine Learning Models Overview In this lab you will explore techniques for improving and evaluating the performance of machine learning models. You will

More information

Exploring Econometric Model Selection Using Sensitivity Analysis

Exploring Econometric Model Selection Using Sensitivity Analysis Exploring Econometric Model Selection Using Sensitivity Analysis William Becker Paolo Paruolo Andrea Saltelli Nice, 2 nd July 2013 Outline What is the problem we are addressing? Past approaches Hoover

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Evaluating the Usefulness of Sentiment Information for Focused Crawlers Evaluating the Usefulness of Sentiment Information for Focused Crawlers Tianjun Fu 1, Ahmed Abbasi 2, Daniel Zeng 1, Hsinchun Chen 1 University of Arizona 1, University of Wisconsin-Milwaukee 2 futj@email.arizona.edu,

More information

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Predictive Analysis: Evaluation and Experimentation. Heejun Kim Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training

More information

Slice Intelligence!

Slice Intelligence! Intern @ Slice Intelligence! Wei1an(Wu( September(8,(2014( Outline!! Details about the job!! Skills required and learned!! My thoughts regarding the internship! About the company!! Slice, which we call

More information

Recommendation System for Location-based Social Network CS224W Project Report

Recommendation System for Location-based Social Network CS224W Project Report Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless

More information

NeighborWatcher: A Content-Agnostic Comment Spam Inference System

NeighborWatcher: A Content-Agnostic Comment Spam Inference System NeighborWatcher: A Content-Agnostic Comment Spam Inference System Jialong Zhang and Guofei Gu Secure Communication and Computer Systems Lab Department of Computer Science & Engineering Texas A&M University

More information

Classifying Depositional Environments in Satellite Images

Classifying Depositional Environments in Satellite Images Classifying Depositional Environments in Satellite Images Alex Miltenberger and Rayan Kanfar Department of Geophysics School of Earth, Energy, and Environmental Sciences Stanford University 1 Introduction

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

white paper 4 Steps to Better Keyword Grouping Strategies for More Effective & Profitable Keyword Segmentation

white paper 4 Steps to Better Keyword Grouping Strategies for More Effective & Profitable Keyword Segmentation white paper 4 Steps to Better Keyword Grouping Strategies for More Effective & Profitable Keyword Segmentation 2009, WordStream, Inc. All rights reserved. WordStream technologies are protected by pending

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

A study of Video Response Spam Detection on YouTube

A study of Video Response Spam Detection on YouTube A study of Video Response Spam Detection on YouTube Suman 1 and Vipin Arora 2 1 Research Scholar, Department of CSE, BITS, Bhiwani, Haryana (India) 2 Asst. Prof., Department of CSE, BITS, Bhiwani, Haryana

More information

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS Juan Martinez-Romo and Lourdes Araujo Natural Language Processing and Information Retrieval Group at UNED * nlp.uned.es Fifth International Workshop

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Jincheng Cao, SCPD Jincheng@stanford.edu 1. INTRODUCTION When running a direct mail campaign, it s common practice

More information

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set. Evaluate what? Evaluation Charles Sutton Data Mining and Exploration Spring 2012 Do you want to evaluate a classifier or a learning algorithm? Do you want to predict accuracy or predict which one is better?

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

URL Phishing Analysis using Random Forest

URL Phishing Analysis using Random Forest International Journal of Pure and Applied Mathematics Volume 118 No. 20 2018, 4159-4163 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu URL Phishing Analysis using Random Forest S.

More information

Link Analysis in Web Mining

Link Analysis in Web Mining Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained

More information

Measuring Similarity to Detect

Measuring Similarity to Detect Measuring Similarity to Detect Qualified Links Xiaoguang Qi, Lan Nie, and Brian D. Davison Dept. of Computer Science & Engineering Lehigh University Introduction Approach Experiments Discussion & Conclusion

More information