Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Similar documents
Previous: how search engines work

Web Spam Challenge 2008

Page Rank Link Farm Detection

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms

Learning to Detect Web Spam by Genetic Programming

An introduction to Web Mining part II

Exploring both Content and Link Quality for Anti-Spamming

Fraudulent Support Telephone Number Identification Based on Co-occurrence Information on the Web

Introduction to Data Mining

Link Analysis in Web Mining

A Survey of Major Techniques for Combating Link Spamming

CS47300 Web Information Search and Management

Web Spam Detection with Anti-Trust Rank

Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers

Robust PageRank and Locally Computable Spam Detection Features

Anti-Trust Rank for Detection of Web Spam and Seed Set Expansion

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

Detecting Content Spam on the Web through Text Diversity Analysis

An Overview of Search Engine Spam. Zoltán Gyöngyi

Identifying Spam Link Generators for Monitoring Emerging Web Spam

Detecting Spam Web Pages

Incorporating Trust into Web Search

A Reference Collection for Web Spam

Slides based on those in:

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER

Topical TrustRank: Using Topicality to Combat Web Spam

CS425: Algorithms for Web Scale Data

Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection

Big Data Analytics CSCI 4030

Splog Detection Using Self-Similarity Analysis on Blog Temporal Dynamics. Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle Tseng

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

Web Spam Taxonomy. Abstract. 1 Introduction. 2 Definition. Hector Garcia-Molina Computer Science Department Stanford University

Analysis of Large Graphs: TrustRank and WebSpam

New Classification Method Based on Decision Tree for Web Spam Detection

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Analyzing and Detecting Review Spam

Mining Web Data. Lijun Zhang

Part 1: Link Analysis & Page Rank

Web Spam Taxonomy. Zoltán Gyöngyi Hector Garcia-Molina

Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

Analyzing the Popular Words to Evaluate Spam in Arabic Web Pages

Analysis of Web Pages through Link Structure

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Big Data Analytics CSCI 4030

Keywords Web Mining, Language Model, Web Spam Detection, SVM, C5.0,

World Wide Web has specific challenges and opportunities

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

Mining Web Data. Lijun Zhang

A novel approach of web search based on community wisdom

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Time-aware Authority Ranking

Classifying Spam using URLs

New Metrics for Reputation Management in P2P Networks

An Archetype for Web Mining with Enhanced Topical Crawler and Link-Spam Trapper

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Adversarial Web Search. Contents

Welcome to the class of Web and Information Retrieval! Min Zhang

NeighborWatcher: A Content-Agnostic Comment Spam Inference System

Adversarial Information Retrieval on the Web AIRWeb 2006

Unit VIII. Chapter 9. Link Analysis

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al.

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Measuring Similarity to Detect Qualified Links

Identifying Web Spam With User Behavior Analysis

University of TREC 2009: Indexing half a billion web pages

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

International Journal of Advanced Research in Computer Science and Software Engineering

Ranking web pages using machine learning approaches

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Information Retrieval Issues on the World Wide Web

Fast Dynamic Reranking in Large Graphs

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Method to Study and Analyze Fraud Ranking In Mobile Apps

IDENTIFYING SEARCH ENGINE SPAM USING DNS. A Thesis SIDDHARTHA SANKARAN MATHIHARAN

Identifying Web Spam with the Wisdom of the Crowds

Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee

Data-Intensive Computing with MapReduce

REPORT DOCUMENTATION PAGE

Absorbing Random walks Coverage

Absorbing Random walks Coverage

Part I: Data Mining Foundations

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Clique-Attacks Detection in Web Search Engine for Spamdexing using K-Clique Percolation Technique

FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM

Survey on Web Spam Detection: Principles and Algorithms

Link Analysis and Web Search

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

A study of Video Response Spam Detection on YouTube

Link Analysis in Web Information Retrieval

Effective Spam Filtering using Random Forest Machine Learning Algorithm

Authoritative K-Means for Clustering of Web Search Results

Transcription:

Seminar: Future Of Web Search University of Saarland Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood Tutor : Klaus Berberich Date : 17-Jan-2008

The Agenda Focus of the First Paper Motivation The Objective DATASET Web Spam Link Based Features Content Based Features Using the Content & Link Based Features Using the Web Topology Conclusion 2

The Agenda Focus of the First Paper Motivation The Objective DATASET Web Spam Link Based Features Content Based Features Using the Content & Link Based Features Using the Web Topology Conclusion 3

On The Web Includes Information + Free Movies + Hotel Reservation + Casino + Buy Cheap Software + Flight Bookings + Win Lottery & etc Image: www.milliondollarhomepage.com 4

What is Web Spamming? Any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance of some web page considering the page s true value [5] Also called Search Engine Spamming or Spamdexing [5]: Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web, 2005. 5

What is Web Spamming? Example Query: Kaiser Pharmacy (at the time paper was written 2005) Result: had techdictionary.com as a 3 rd hit 6

Why is Web Spamming Bad? Search Engines suffer because It damages the search engine s reputation there is a the cost involved to crawling, indexing, storing the spam pages. Users suffer because precision in the query results is lowered 7

How is it done? Web Spamming Techniques involve Boosting Techniques Content spam (e.g, Term repetition) Link spam (e.g Link farming) Hiding Techniques For example, Content hiding, Cloaking 8

What do the Search Engines Ultimately want? Want to calculate and return the exact page ranks based on relevance and importance Want to avoid the spam pages altogether before they use resources that might be used in storing or processing those web pages 9

The Agenda Focus of the First Paper Motivation The Objective DATASET Web Spam Link Based Features Content Based Features Using the Content & Link Based Features Using the Web Topology Conclusion 10

The OBJECTIVE A Web Spam Detection System that is most accurate and reliable The paper proposed a Web Spam Detection System that 1. uses the topology of the web graph by exploiting dependencies among the web pages 2. that combines both the link and content based features 11

The Flow Feature extraction Link based Content based Combining link & content based features Classification Clustering Propagation Stacked graphical learning Smoothing 12

The Agenda Focus of the First Paper Motivation The Objective DATASET Web Spam Link Based Features Content Based Features Using the Content & Link Based Features Using the Web Topology Conclusion 13

Data Set Data Set Collected in May 2006 Web Spam Publicly available WEBSPAM-UK2006 dataset [15] Pages from.uk domain 77.9 million pages were collected corresponding to 11,400 hosts A group of volunteers labeled them normal, borderline or spam 6,552 hosts evaluated [15] http://www.yr-bcn.es/webspam/datasets/ 14

Data Set Distribution of host labels, as judged by human volunteers Evaluation True positive rate, or Recall R False positive rate F-measure 15

The Agenda Focus of the First Paper Motivation The Objective DATASET Web Spam Link Based Features Content Based Features Using the Content & Link Based Features Using the Web Topology Conclusion 16

Link Based Features Degree related measures PageRank TrustRank Truncated PageRank Estimation of d supporters 140 features per host 17

Link Based Features Page Rank Uses link structure to determine importance or popularity of a web page Intuition: A web page is important if several other important pages point to it PageRank of a page influences and is being influenced by importance of the other pages 18

Link Based Features Page Rank Calculation: Initial scores already defined for all the pages B C A D B C A D [6] http://en.wikipedia.org/wiki/pagerank 19

Link Based Features Page Rank Page Rank with Random Surfer Model d is damping factor, M(p i ) is the set of pages that link to pi, L(p j ) is the number of outbound links on page pj, N is the total number of pages. 20

Link Based Features Trust Rank A page having high pagerank is more likely to be spam if it had no relationship with a set of trusted pages How it works? Determine the seed set S (using high PageRank or high out-degree) Determine good or bad nodes of the set S (using oracle) E.g, S ={,, } where, good = bad = 21

Link Based Features Trust Rank Calculate & propagate the trust from good pages (adjusting trust attenuation) Image [4]: http://www.tejedoresdelweb.com/slides/2007_ojobuscador_madrid_spam.pdf 22

Link Based Features Histogram of ratio b/w TrustRank & PageRank 23

Link Based Features Truncated Page Rank A variant of PageRank, diminishes the influence of a page to the PageRank score of its close neighbors web spam web 24

Link Based Features Estimation of d-supporters x is d-supporter of node y if shortest path from x to y has length d N d (x) be the set of d-supporters of page x For each page x, cardinality of N d (x) is an increasing fuction with respect to d. Image [4] 25

Link Based Features Bottleneck Number The bottleneck measure for page x, defined as indicates the minimum rate of growth of neighbors of x up to a certain distance spam pages have smaller bottle neck numbers than non-spam pages Bottleneck: Non-Spam & Spam Image [4] 26

Link Based Features Histogram of b 4 (x) of Spam and Non-spam Pages 27

The Agenda Focus of the First Paper Motivation The Objective DATASET Web Spam Link Based Features Content Based Features Using the Content & Link Based Features Using the Web Topology Conclusion 28

Content Based Features Number of words in the page, title & average word length Fraction of anchor text Fraction of visible text Compression rate Corpus precision & corpus recall Query precision and query recall Independent trigram likelihood Entropy of trigrams 96 features per host 29

Content Based Features Average word length 30

Content Based Features Compression Rate Compression rate = size of compressed text (visible text) size of uncompressed text Precision and Recall F= Frequent terms in the collection T= Terms in the page Q= Frequent terms in the query log Corpus Recall = F T / F Query Recall = Q T / Q 31

Content Based Features Corpus Precision = F T / T Query Precision = Q T / T 32

Content Based Features Entropy of trigrams (Compression) Calculated on distribution of trigrams Let be probability distribution on trigrams of a page be the set of all trigrams in a page Entropy of trigrams= 33

The Agenda Focus of the First Paper Motivation The Objective DATASET Web Spam Link Based Features Content Based Features Using the Link and Content Features Using the Web Topology Conclusion 34

Using Link and Content Based Features Cost Sensitive Decision Tree Bagging 35

Using Content & Link Based Features Comparing Link and Content based features 36

The Agenda Focus of the First Paper Motivation The Objective DATASET Web Spam Link Based Features Content Based Features Using the Content & Link Based Features Using the Web Topology Conclusion 37

Using the Web Topology Observation : Similar pages tend to be linked to be linked together more frequently than dissimilar ones Similar pages tend to be linked to be linked together more frequently than dissimilar ones 38

Using the Web Topology 39

Using the Web Topology -Topological Dependencies of Spam Nodes Sout= No. of spam hosts linked by x All hosts linked by x Sout Sin= No. of spam hosts linked to x All hosts linked to x Sin 40

Using the Web Topology Clustering Using METIS graph clustering algorithm 41

Using the Web Topology Clustering 42

Using the Web Topology Propagation Use the graph topology to smooth predictions by propagating them as random walks Main Idea Use the predicted spamicity of a particular classification method and start a random walk with the restart probabilty 1-α Where h: host p(h) : [0..1] ( p(h)=0: non-spam, p(h)=1: spam, for each host h) v (0) : vector such that v (0) h = outdeg(g): the out-degree of g 43

Using the Web Topology- Propagation Propagation Applied with three forms of random walk on the host graph (forward) on the transposed host graph (backward) on the undirected host graph (both) Observed improvements on out of other tried values 44

Using the Web Topology Propagation Results with in the table 45

Using the Web Topology Stacked Graph Learning Uses initial predictions by the classification scheme C for all the objects in the data set Generates a set of extra features for each object based on relevance ( w.r.t in-links, outlinks or both) Let r(h) be the set if pages related to host h Then, Adds these extra features f(h) to the input of C and run the algorithm again 46

Using the Web Topology Stacked Graph Learning Improvement is seen in the first Iteration 47

Using the Web Topology Stacked Graph Learning Second Iteration showed even better results 48

The Agenda Focus of the First Paper Motivation The Objective DATASET Link Based Features Content Based Features Using the Web Topology Conclusion Web Spam 49

Conclusion Experiments have clearly shown that Combining both the link and content based features brings significant improvement to spam detection techniques Using web graph dependencies, we can detect the web spam by turning spammers ingenuity against themselves 50

Thank You! Q&A and Discussion 51

References-1 [1] www.milliondollarhomepage.com [2] http://www.dcc.uchile.cl/~ccastill/papers/cdgms_2006_know_your_neighbors.pdf [3] http://images.google.com/imgres? [4] http://www.tejedoresdelweb.com/slides/2007_ojobuscador_madrid_spam.pdf [5] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web, 2005. [6] http://en.wikipedia.org/wiki/pagerank [7] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of Web Spam. In AIRWeb, 2006. [8] http://infolab.stanford.edu/~zoltan/publications/gyongyi2006link.pdf 52

References-2 [9] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detection. Technical report, DELIS Dynamically Evolving, Large-Scale Information Systems, 2006. [10] http://en.wikipedia.org/wiki/cross-validation [11] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the World Wide Web conference, pages 83 92, Edinburgh, Scotland, May 2006. [12] http://www.salford-systems.com/doc/bagging_predictors.pdf [13] Z. Gy ongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proc. of the 30th VLDB Conf., 2004. [14] M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. ACM SIGIR Forum, 36(2), 2002. [15] http://www.yr-bcn.es/webspam/datasets/ 53