Identifying Web Spam With User Behavior Analysis

Similar documents
Identifying Web Spam with the Wisdom of the Crowds

Automatic Search Engine Evaluation with Click-through Data Analysis. Yiqun Liu State Key Lab of Intelligent Tech. & Sys Jun.

Automatic Query Type Identification Based on Click Through Information. State Key Lab of Intelligent Tech. & Sys Tsinghua University

Fraudulent Support Telephone Number Identification Based on Co-occurrence Information on the Web

Automatic Query Type Identification Based on Click Through Information

Data Cleansing for Web Information Retrieval Using Query Independent Features

Estimating Credibility of User Clicks with Mouse Movement and Eye-tracking Information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Ranking of ads. Sponsored Search

SEOHUNK INTERNATIONAL D-62, Basundhara Apt., Naharkanta, Hanspal, Bhubaneswar, India

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags

Welcome to the class of Web Information Retrieval!

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Getting the most from your websites SEO. A seven point guide to understanding SEO and how to maximise results

Information Retrieval Spring Web retrieval

Detecting Spam Web Pages

SEO. Definitions/Acronyms. Definitions/Acronyms

How Big Data is Changing User Engagement

NTUBROWS System for NTCIR-7. Information Retrieval for Question Answering

SEO ISSUES FOUND ON YOUR SITE (MARCH 29, 2016)

The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? A-Z of Digital Marketing Translation

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

Mining Web Data. Lijun Zhang

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

SEO ISSUES FOUND ON YOUR SITE (JANUARY 2, 2017)

Splog Detection Using Self-Similarity Analysis on Blog Temporal Dynamics. Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle Tseng

Get Traffic! will be starting shortly. Get Traffic!

tray.com SEO ISSUES FOUND ON YOUR SITE (MARCH 1, 2018)

Link Analysis in Web Mining

CS47300 Web Information Search and Management

deseo: Combating Search-Result Poisoning Yu USF

Big Data Analytics CSCI 4030

How To Construct A Keyword Strategy?

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

This Event Is About Your Internet Presence.

What is SEO? { Search Engine Optimization }

News Page Discovery Policy for Instant Crawlers

whitehvac.com SEO ISSUES FOUND ON YOUR SITE (JANUARY 2, 2017)

SEO ISSUES FOUND ON YOUR SITE (DECEMBER 30, 2016)

SEO ISSUES FOUND ON YOUR SITE (JANUARY 2, 2017)

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

What Is Voice SEO and Why Should My Site Be Optimized For Voice Search?

Introduction. But what about some of the lesser known SEO techniques?

Advertising Network Affiliate Marketing Algorithm Analytics Auto responder autoresponder Backlinks Blog

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Module 1: Internet Basics for Web Development (II)

Search Engine Optimization. Presentation Overview 1/7/2014. Internet Marketing Reality Check: Discover the Techniques Proven to Increase Leads

Brief (non-technical) history

Detecting Tag Spam in Social Tagging Systems with Collaborative Knowledge

Top 3 Marketing Metrics You Should Measure in Google Analytics

15 Minute Traffic Formula. Contents HOW TO GET MORE TRAFFIC IN 15 MINUTES WITH SEO... 3

Internet Lead Generation START with Your Own Web Site

Page Title is one of the most important ranking factor. Every page on our site should have unique title preferably relevant to keyword.

Clickbank Domination Presents. A case study by Devin Zander. A look into how absolutely easy internet marketing is. Money Mindset Page 1

Introduction to Information Retrieval

Lawrence Ying Sr Tech Lead, Google Platforms Aug 9, 2018

Books, Search Engine Optimization 2016) PDF

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Executed by Rocky Sir, tech Head Suven Consultants & Technology Pvt Ltd. seo.suven.net 1

Sucuri Webinar Q&A HOW TO IDENTIFY AND FIX A HACKED WORDPRESS WEBSITE. Ben Martin - Remediation Team Lead

Mining Web Data. Lijun Zhang

Below, we will walk through the three main elements of the algorithm, which include Domain Attributes, On-Page and Off-Page factors.

Launch Store. University

SEO Services and Plans. SEO brings in leads and new customers. Your competitors are profiting from SEO. Are you?

SEO Myths. Content Marketing MeetUp May 2, 2018 San Mateo

Gary Viray Founder, Search Opt Media Inc. Search.Rank.Convert.

Effective On-Page Optimization for Better Ranking

Constructing Websites toward High Ranking Using Search Engine Optimization SEO

Information Retrieval. Lecture 9 - Web search basics

Department of Computer Science & Engineering The Graduate School, Chung-Ang University. CAU Artificial Intelligence LAB

What is SEO? Search Engine Optimization 101

Lec 8: Adaptive Information Retrieval 2

Top-To-Bottom (And Beyond) On-Page Optimization Guidebook

Exploring both Content and Link Quality for Anti-Spamming

Searching the Deep Web

How to do an On-Page SEO Analysis Table of Contents

ANALYTICS DATA To Make Better Content Marketing Decisions

Mobile First: Data & Facts on mobile SERPs

Promoting Website CS 4640 Programming Languages for Web Applications

Image Credit: Photo by Lukas from Pexels

SEO. A Lecture by Usman Akram for CIIT Lahore Students

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

SEARCH ENGINE OPTIMIZATION (SEO) SHERRY BONELLI, early bird digital marketing

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining

Monthly SEO Report. Example Client 16 November 2012 Scott Lawson. Date. Prepared by

GROW YOUR BUSINESS WITH HELP FROM GOOGLE Ranking Factors

SEO: HOW TO DRIVE MORE TRAFFIC TO YOUR WEBSITE

DATA MINING - 1DL105, 1DL111

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0

416 Distributed Systems. Mar 3, Peer-to-Peer Part 2

Q2 TLS 1.0 Disablement Frequently Asked Questions 12/4/17

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH

Whitepaper US SEO Ranking Factors 2012

Making Money with Hubpages CA$HING IN WITH ARTICLE WRITING. Benjamin King

Searching the Deep Web

A Complete Course. How a comprehensive marketing strategy can help your company win more business and protect its brand.

Today we show how a search engine works

Website Designs Australia

Transcription:

Identifying Web Spam With User Behavior Analysis Yiqun Liu, Rongwei Cen, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Tech. & Sys. Tsinghua University 2008/04/23

Introduction simple math How many spam pages are there on the Web? Over 10% (Fetterly et al. 2004, Gyöngyi et al. 2004) Web has 152 billion pages (How Much Info project 2003) How many can a search engine index? Tens of billions (Google: 8 billion@2004, Yahoo: 20 billion@2005) #(spam) is equal to/more than search engines index sizes Search index will be filled with useless pages without spam detection. We have developed lots of spam detection methods However

Introduction Search N95 battery time with a certain Chinese search engine on 08/04/17 Result #1: a cloaking spam Result #3: the page cannot be connected (cache shows a content spam) Result #4: search result from another engine with ads (also a content spam)

Introduction Problem: spam detection has been an ever-lasting process Good news for anti-spam engineers! Bad news for Web users / search engines Are detection methods not effective? No! Lots of works report over 90% detection accuracy (Ntoulas et al. 2006, Saito et al. 2007, Lin et al. 2007, ) Are detection methods not timely? Yes! When one kind of spam appears, it takes a long time for anti-spam engineers to realize the appearance.

Introduction How does spam make a profit? For a certain kind of Web spam technique UV / Profit T1 T2 Time

Introduction Important: find new kind of spam as soon as possible Detect a new kind of Web spam technique timely Reduce the spam profit When profit < cost, spam stops

User-behavior Features Users will at first realize the existence of a new spam page How to use the wisdom of crowds to detect spam? Social annotation? (possible noises) Web access log analysis. Web access logs Collected by a commercial search engine July 1st, 2007 to August 26th, 2007 2.74 billion user clicks in 800 million Web pages

User-behavior Features The behavior features we propose How many user visits are oriented from search engine? How many users will follow links on the page? How many users will not visit the site in the future? How many user visits are oriented by hot keyword searches? How many pages does a certain user visit in the site? How many users visit the site?

User-behavior Features Search engine oriented visiting rate (SEOV rate) Web spam are designed to get an unjustifiably favorable relevance or importance score from search engines. (Gyongyi et. al. 2005) Assumption: Most user visits to Web spam are from search engine result lists Definition: #( Search engine oriented visits of SEOV ( p) = #( Visits of p) p)

User-behavior Features SEOV rate distribution Percentage 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Ordinary Page Spam Page 0 0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1 SEOV Some spam don t receive many UV from search engines, either. Most ordinary pages user visits are not from search engines

User-behavior Features Source page rate (SP rate) Spam pages are usually designed to show users ads/low-quality information at their first look. Users don t trust hyperlinks on spam pages Assumption: Most Web users will not follow hyperlinks on spam pages Definition: SP ( p) = # ( p appears as the source page) # ( p appears in the Web access logs)

User-behavior Features SP rate distribution Percentage 0.6 0.5 0.4 0.3 0.2 0.1 Ordinary Page Spam Page 0 <0.05 0.05-0.1 0.1-0.15 0.15-0.2 0.2-0.25 0.25-0.3 0.3-0.35 0.35-0.4 0.4-0.45 >0.45 SP User clicks hyperlink on some spam page, too. (users may be cheated by anchor texts) Half of spam pages have very small SP values

User-behavior Features Short-time Navigation Rate (SN rate) Users cannot be cheated again and again during a small time period Assumption: Most Web users will not visit a spam site many times in a same user session Definition: SN ( s) = # ( Sessions in which users visit less than N pages in s) # ( Sessions in which users visit s) N: parameter

User-behavior Features SN rate distribution (N = 3) 0.6 0.5 Ordinary Page Spam Page Percentage 0.4 0.3 0.2 0.1 0 0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1 SN A number of ordinary pages also receive few UVs in a session. (redirection sites, low-quality sites, ) Few spam pages are visited over 2 times in a session

User-behavior Features Correlation values between these features Different assumption Different information sources Relatively low correlation Possible to use Bayes learning methods SEOV SP SN SEOV 1.0000 0.1981 0.1780 SP 0.1981 1.0000 0.0460 SN 0.1780 0.0460 1.0000

Problem: Detection algorithm Uniform sampling of negative examples (pages which are not spam) is difficult Solution: Learning from positive examples (Web spam) and unlabelled data (Web corpus) Calculate the possibility of a page p being Web spam using user behavior features

Detection algorithm For a single feature A: P( p Spam p has feature A) #( p has feature A p Spam sample #( Spam sample set) set) #( p has feature #( CORPUS) A) For three features SEOV, SP and SN: Features are approximately independent as well as conditionally independent given the target value P( p Spam p has feature n i= 1 #( p has ( feature A p Spam sample i A, A,..., A ) #( Spam sample set) 1 2 n set) #( p has feature #( CORPUS) Ai ) )

Detection algorithm Algorithm Description

Experimental Results Experiment setup Training set: 802 spam sites Collected from the hottest search queries result lists Test set: 1564 Web sites annotated with whether it is spam or not 345 spam, 1060 non-spam, 159 cannot tell Percentage of spam is higher than the estimation given by Fetterly et. al. and Gyöngyi et. al.. (we only retain the sites which are visited at least 10 times)

Experimental Results How to evaluate the performance Focus: find the recently-appeared spam types (not to detect all possible spam types) 1: Whether the spam candidates identified by this algorithm are really Web spam. (effectiveness) 2: Whether this algorithm detect spam types more timely than current search engines. (timeliness) 3: Which feature is more effective?

Experimental Results Detection performance (effectiveness) Whether the top-ranked candidates are Web spam 300 Pages with the highest P(Spam) values Only 6% are not Web spam (low-quality page, SEO page) Many spam types can be identified. (wisdom of crowds) Page Type Percentage Non-spam pages 6.00% Web spam pages (Content spamming) 21.67% Web spam pages (Link spamming) 23.33% Web spam pages (Other spamming) 10.67% Pages that cannot be accessed 38.33%

Experimental Results Detection performance (timeliness) Experiments with one of the most frequently-used Chinese search engines (use X to represent it) Recent data: Access logs from 08/02/04 to 08/03/02 Top-ranked spam candidate sites 723/1000 are spam sites (some failed to be connected) X indexed 34 million pages from these 723 sites in early Mar. 59 million pages were indexed by X at the end of Mar. These spam are not detected by X, X spent lots of resources on these useless pages

Experimental Results Detection performance (algorithm & features) AUC value of the detection algorithm is about 80%

Experimental Results Detection performance (algorithm & features) Learning algorithm gains better performance than any single feature

Experimental Results Detection performance (algorithm & features) SN performs the worst: Examples: Q&A portal, Audio/Video sharing sites.

Conclusions The amount of Web spam is perhaps over search engine index size Timeliness is as important as effectiveness in spam detection User behavior features can be used to find recentlyappeared spam types timely and effectively