NeighborWatcher: A Content-Agnostic Comment Spam Inference System

Similar documents
Commtouch Messaging Security for Hosting Providers

Unit VIII. Chapter 9. Link Analysis

Part 1: Link Analysis & Page Rank

Slides based on those in:

Introduction to Data Mining

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Naming in Distributed Systems

Detecting Malicious Web Links and Identifying Their Attack Types

Aiding the Detection of Fake Accounts in Large Scale Social Online Services

Phishing URLs and Decision Trees. Hitesh Dharmdasani

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

DATA MINING - 1DL105, 1DL111

Big Data Analytics CSCI 4030

Topology-Based Spam Avoidance in Large-Scale Web Crawls

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Twi$er s Trending Topics exploita4on pa$erns

Introduction. Can we use Google for networking research?

FAQ: Crawling, indexing & ranking(google Webmaster Help)

World Wide Web has specific challenges and opportunities

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0

MEASURING AND FINGERPRINTING CLICK-SPAM IN AD NETWORKS

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Mavenir Spam and Fraud Control

Mining Web Data. Lijun Zhang

DATA MINING II - 1DL460. Spring 2014"

CS425: Algorithms for Web Scale Data

Using Spam Farm to Boost PageRank p. 1/2

Information Retrieval Spring Web retrieval

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits

Analysis of Large Graphs: TrustRank and WebSpam

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Search Engines. Information Retrieval in Practice

Information Retrieval

Turbo King: Framework for Large- Scale Internet Delay Measurements

Searching the Web What is this Page Known for? Luis De Alba

Who Is Peeping at Your Passwords at Starbucks? To Catch an Evil Twin Access Point

SEO: SEARCH ENGINE OPTIMISATION

Searching the Deep Web

Link Analysis and Web Search

Spamming Botnets: Signatures and Characteristics

Link Farming in Twitter

Exploring both Content and Link Quality for Anti-Spamming

Finding the Linchpins of the Dark Web: A Study on Topologically Dedicated Hosts on Malicious Web Infrastructures

(Un)wisdom of Crowds: Accurately Spotting Malicious IP Clusters Using Not-So-Accurate IP Blacklists

Improving the Efficiency of Multi-site Web Search Engines

Mining Web Data. Lijun Zhang

Creating Customized Whitelist Domains from DNS Traffic

EVILCOHORT: Detecting Communities of Malicious Accounts on Online Services

Extracting Rankings for Spatial Keyword Queries from GPS Data

Demands on task recommendation in crowdsourcing platforms the worker s perspective

Malicious Web Pages Detection Based on Abnormal Visibility Recognition

An Empirical Study of Behavioral Characteristics of Spammers: Findings and Implications

Security: Worms. Presenter: AJ Fink Nov. 4, 2004

Around the Web in Six Weeks: Documenting a Large-Scale Crawl

Information Retrieval

Fighting Spam, Phishing and Malware With Recurrent Pattern Detection

An Introduction to Search Engines and Web Navigation

deseo: Combating Search-Result Poisoning Yu USF

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Anti-Phishing Method for Detecting Suspicious URLs in Twitter

Detect Cyber Threats with Securonix Proxy Traffic Analyzer

Information Retrieval May 15. Web retrieval

Panda Security. Protection. User s Manual. Protection. Version PM & Business Development Team

Mojo Aware Feature Brief

The Nuts and Bolts of a Forum Spam Automator

Structured data can be processed, but unstructured data cannot. We cannot use various kinds of data unrestrictedly in the field of business

SEO for Internet Marketing Newbies. Moe Muise

Marketing 201. March, Craig Stouffer, Pinpointe Marketing (408) x125

Measuring Intrusion Detection Capability: An Information- Theoretic Approach

Big Data Analytics CSCI 4030

A study of Video Response Spam Detection on YouTube

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

CS47300 Web Information Search and Management

Table of Contents. P - client.be

Page Rank Link Farm Detection

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University

Search Engine Technology. Mansooreh Jalalyazdi

UNIK: Unsupervised Social Network Spam Detection

How Far Can Client-Only Solutions Go for Mobile Browser Speed?

DIALING BACK PHONE VERIFIED ACCOUNT ABUSE. Kurt Thomas, Dmytro Iatskiv, Elie Bursztein, Tadek Pietraszek, Chris Grier (Databricks), Damon McCoy (GMU)

SEVENMENTOR TRAINING PVT.LTD

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Malicious Activity and Risky Behavior in Residential Networks

CS6200 Information Retreival. Crawling. June 10, 2015

Corso di Biblioteche Digitali

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Corso di Biblioteche Digitali

Europcar International Franchisee Websites Search Engine Optimisation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Semantic Analysis of Search-Autocomplete Manipulations

Identifying Web Spam With User Behavior Analysis

Distributed Web Crawling over DHTs. Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4

Securing Behavior-based Opinion Spam Detection

Automated Generation of Event-Oriented Exploits in Android Hybrid Apps

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Detecting Drive-by-Download Attacks based on HTTP Context-Types Ryo Kiire, Shigeki Goto Waseda University

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Transcription:

NeighborWatcher: A Content-Agnostic Comment Spam Inference System Jialong Zhang and Guofei Gu Secure Communication and Computer Systems Lab Department of Computer Science & Engineering Texas A&M University

Spamdexing Comment Spamming

Comment Spamming Comment spamming: refers to the behavior of automatically and massively posting random/specific comments to benign third-party websites that allow user generated content Blog Forum GuestBook Benefits Improve search rank Not to be easily blocked Scalability Low cost

Comment Spamming Current research Content-based detection (Y. Shin et al. [INFOCOM'12]) Easy to evade Context-based detection (Y. Niu et al. [NDSS'07]) Low coverage Honey blogs Passive Low coverage

Motivation Comment spamming What are the properties of those benign websites (Harbors), which spammers usually spam on? What s the infrastructure of those harbors? Can we exploit the infrastructure of harbors to help detect spam? Assumption Each spammer has relative stable spamming harbors Spammers intend to massively and automatically post the spam URL on their spamming harbors

Threat Model Harbor Search Harbor List Harbor Testing Harbor Confirm Spamming

Data Collection Seeds 10,000 spam links from previous work (J. Zhang et al.[raid 12]) Methods Search spam links in Google Security websites that report search links as spam Benign websites that link to search links Spam harbors

Data Collection Harbor Spam

Data Collection Seeds 10,000 spam links from previous work (J. Zhang et al.[raid 12]) Methods Search spam links in Google Security websites that report search links as spam Benign websites that link to search links Spam harbors Extract search results with embedded hyperlinks tags in their content (e.g., [URL] [/URL]) Blog Forum GuestBook Other Total # of search results 27,846 29,860 31,926 500,717 590,349 # of harbors (domain) 4,807 2,515 3,878 27,713 38,913 # of active harbors 4,685 2,185 3,419 25,642 35,931 # of postings 532,413 640,073 1,469,251 6,497,263 9,139,000

Comment Spamming Study What are the properties of those benign websites (Harbors), which spammers usually spam on Quality of harbors Intuition: the higher quality spam harbors have, the more effective comment spamming is PageRank Lifetime Google Indexing Interval

Quality of Harbors PageRank Randomly choose 1,000 spam harbors in each category Spammers choose harbors regardless of their reputation to compensate the relatively poor quality of individuals

Quality of Harbors Life Time Definition: the time interval between the posting time of the first spam and the recent spam. Randomly choose 100 harbor in each category Spammers tend to explore some stable harbors which they can keep spamming on

Quality of Harbors Google Indexing Interval: Definition: the time difference between two consecutive Google crawling time (Google cache) of same spam harbor Randomly choose 100 harbor in each category There exists a long time lag between spamming time and indexing time

Comment Spamming Study What s the infrastructure of spam harbors? Infrastructures of harbors Intuition: spammers build artificial relationships among spam harbors Relation graph How spammers use their infrastructure to distribute spam

Infrastructure of Harbors Relation Graph Shared Harbor Spamming community Spammers build artificial relationships among those harbors Although different spammers may have different strategies to find their harbors, there exist some intersections among them

Infrastructure of Harbors Distributing Spam Whether spam messages are distributed at the same time Time Centrality Ratio: the maximal number of harbors that post the spam in the same month over the total number of harbors that post this spam Spammers tend to utilize their spam infrastructure in similar time

NeighborWatcher Can we exploit the infrastructure of harbors to help detect comment spam? Detection of comment spam Intuition: if a link is posted on a set of harbors that have a close relationship at a similar time, it has a high possibility to be spam NeighborWatcher

NeighborWatcher System Design System Architecture

NeighborWatcher Building Spamming Infrastructure

NeighborWatcher Spamming Inference url.com Spamming structure Real posting structure url.com Modified Cosine similarity

Evaluation Stability of Spamming Structure Changed Relationship :2-1=1

Evaluation Effectiveness of Inference Ground Truth: 500 verified spam Top 20,000 domains from Alexa => 754 benign links Hit Count: the number of correctly inferred spam Hit Rate: the ratio of correctly inferred spam to the total number of inferred spam Sim. Threshold Hit Rate Hit Count False Positive 0.3 57.29% 432 322 0.4 75.93% 426 135 0.5 97.14% 408 12 0.6 97.8% 360 8

Evaluation Constancy Whether our system can continue finding new spam over time?

Applications Early Warning Zero Day Spam: the spam that can not be searched out by Google at that time

Applications BlackLists Existing Blacklist IPs Emails Whether our system can complement existing BlackLists Daily IPs Daily Emails

Summary We conduct a deep study on comment spam from a new perspective: spamming infrastructure We conclude that spammers prefer to keep utilizing their spam harbors for spamming We design a graph based inference system to infer comment spam.

Q&A

Discussion False negatives Only appeared on input harbor Crawl more harbors Need more time False positives 5 are used for testing 7 are spammed on Alexa top 20,000 websites