CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB

Similar documents
Cloak of Visibility. -Detecting When Machines Browse A Different Web. Zhe Zhao

Cloak of Visibility: Detecting When Machines Browse A Different Web

deseo: Combating Search-Result Poisoning Yu USF

Advertising Network Affiliate Marketing Algorithm Analytics Auto responder autoresponder Backlinks Blog

arxiv: v1 [cs.cr] 3 Oct 2017

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans.

Information Retrieval Spring Web retrieval

Search Engines. Information Retrieval in Practice

How to Drive More Traffic to Your Website in By: Greg Kristan

You Are Being Watched Analysis of JavaScript-Based Trackers

SEO According to Google

THE HISTORY & EVOLUTION OF SEARCH

Automating Security Response based on Internet Reputation

This tutorial has been prepared for beginners to help them understand the simple but effective SEO characteristics.

Executed by Rocky Sir, tech Head Suven Consultants & Technology Pvt Ltd. seo.suven.net 1

Search Engine Optimization (SEO) & Your Online Success

Glossary of on line marketing terms

Information Retrieval May 15. Web retrieval

Basics of SEO Published on: 20 September 2017

The Bots Are Coming The Bots Are Coming Scott Taylor Director, Solutions Engineering

WebSite Grade For : 97/100 (December 06, 2007)

Advanced Digital Markeitng Training Syllabus

Information Retrieval. Lecture 9 - Web search basics

5. search engine marketing

Security 08. Black Hat Search Engine Optimisation. SIFT Pty Ltd Australia. Paul Theriault

Traffic Triggers Domain Here.com

Search Engine Optimization. MBA 563 Week 6

Dahlia Web Designs LLC Dahlia Benaroya SEO Terms and Definitions that Affect Ranking

/ SEM Taxonomy & SEO Tactics

SEOHUNK INTERNATIONAL D-62, Basundhara Apt., Naharkanta, Hanspal, Bhubaneswar, India

Why it Really Matters to RESNET Members

Mining Web Data. Lijun Zhang

We Push Buttons. SEO Glossary

Getting Started with Internet Explorer 10

SmartAnalytics. Manual

1. Create your website. 2. Choose a template

3 Media Web. Understanding SEO WHITEPAPER

WEBSITES PUBLISHING. Website is published by uploading files on the remote server which is provided by the hosting company.

COMMUNICATIONS METRICS, WEB ANALYTICS & DATA MINING

The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? A-Z of Digital Marketing Translation

Web Applications: Internet Search and Digital Preservation

Digital Marketing. Introduction of Marketing. Introductions

Jargon Buster. Ad Network. Analytics or Web Analytics Tools. Avatar. App (Application) Blog. Banner Ad

Trusted Profile Identification and Validation Model

Finding the Linchpins of the Dark Web: A Study on Topologically Dedicated Hosts on Malicious Web Infrastructures

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Imperva Incapsula Website Security

CS47300 Web Information Search and Management

WebReach Product Glossary

What is Google Analytics? What Can You Learn From Google Analytics? How Can Google Analytics Help Your Business? Agenda

Europcar International Franchisee Websites Search Engine Optimisation

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

Advanced Digital Marketing Course

Internet Lead Generation START with Your Own Web Site

Review of Meltmethod.com

Review of Seo-made-easy.com

ELEVATESEO. INTERNET TRAFFIC SALES TEAM PRODUCT INFOSHEETS. JUNE V1.0 WEBSITE RANKING STATS. Internet Traffic

Review of Ezgif.com. Generated on Introduction. Table of Contents. Iconography

Consequences of Compromise: Characterizing Account Hijacking on Twitter

TOP RANKING FACTORS A QUICK OVERVIEW OF THE TRENDING TOP RANKING FACTORS

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Advanced SEO Training Details Call Us:

Search Engine Optimization (SEO)

How to do an On-Page SEO Analysis Table of Contents

Detecting Malicious Web Links and Identifying Their Attack Types

Twi$er s Trending Topics exploita4on pa$erns

Review of Cormart-nigeria.com

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN SEO GUIDE FOR SALONS

Digital Marketing on the platform of Search Engine Optimization A review. Sonika Kanojia Program Coordinator Chandigarh University India.

SEO Dubai. SEO Dubai is currently the top ranking SEO agency in Dubai, UAE. First lets get to know what is SEO?

Typosquatting. Janos Szurdi and Nicolas Christin

Mining Web Data. Lijun Zhang

PUBCRAWL: Protecting Users and Businesses from CRAWLers

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Phishing URLs and Decision Trees. Hitesh Dharmdasani

HOW TO CHOOSE A NEXT-GENERATION WEB APPLICATION FIREWALL

Review of Wordpresskingdom.com

SuprCloakr v1.2s. Documentation. Author: tech_09

This Event Is About Your Internet Presence.

p. 2 Copyright Notice Legal Notice All rights reserved. You may NOT distribute or sell this report or modify it in any way.

Glossary of Tech Terms Accelerated Mobile Pages

Unit 4 The Web. Computer Concepts Unit Contents. 4 Web Overview. 4 Section A: Web Basics. 4 Evolution

The Nuts and Bolts of a Forum Spam Automator

3/21/2016 AN INTRODUCTION TO SEARCH ENGINE OPTIMIZATION. Search Engine Optimization (SEO) Basics for Attorneys

URLs excluded by REP may still appear in a search engine index.

Back-Office Web Traffic on the Internet. IMC 2014 Vancouver, BC, CANADA November 5-7, 2014

Provided by TryEngineering.org -

Information Retrieval. Lecture 10 - Web crawling

Detecting Spam Web Pages

Module 1: Internet Basics for Web Development (II)

CS6200 Information Retreival. Crawling. June 10, 2015

AUDIT REPORT BELMONT TV.COM. Sep 14, Report Content Last Updated. On-Page Optimization. Off-Page Optimization. Keywords Report.

How To Construct A Keyword Strategy?

Next Level Marketing Online techniques to grow your business Hudson Digital

Review of Kilwinningrangers.com

Searching. Outline. Copyright 2006 Haim Levkowitz. Copyright 2006 Haim Levkowitz

Search Engine Optimisation

No Plan Survives Contact

[Rajebhosale*, 5(4): April, 2016] ISSN: (I2OR), Publication Impact Factor: 3.785

Transcription:

CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB CIS 601: Graduate Seminar Prof. S. S. Chung Presented By:- Amol Chaudhari CSU ID 2682329

AGENDA About Introduction Contributions Background How Common is Cloaking Perspective of Cloaking Cloaking Techniques Detecting Cloaking Performance and Evaluation

ABOUT Paper Title Cloak of Visibility -Detecting When Machines Browse A Different Web Publication 2016 IEEE Symposium on Security and Privacy Author Luca Invernizzi, Kurt Thomas and other Google researchers

WHAT IS CLOAKING? Hypothetically Technically

WHAT IS CLOAKING? A search engine optimization(seo) technique. The content presented to the search engine spider different from that presented to the user's browser. Done by delivering content based on the IP addresses or the User- Agent HTTP header of the user requesting the page

CONTRIBUTIONS Provides the first broad study of blackhat cloaking techniques and the companies affected Builds a distributed crawler and classifier detecting and bypassing mobile, search, and ads cloaking, with 95% accuracy and a false positive rate of 0.9% Measures the most prominent search, and ads cloaking techniques in the wild; leads to 4.9% of ads and 11.7% of search results cloak against Google s generic crawler Determine the minimum set of capabilities required of security crawlers to contend with cloaking today

BACKGROUND Web Cloaking Incentives (with bad actors) I) Search results Servers will manipulate fake or compromised pages to appear attractive to crawlers (bots) while organic visitors are guided to (illegal) profit-generating content pages II) Advertisements Cloaking site Miscreants will pay advertising networks to display their URLs. They rely on cloaking to avoid ad policies that strictly prohibit dietary scams, trademark infringing goods, or any form of deceptive advertisements including malware III)Drive-by Download Miscreants compromise popular websites and heavy loaded pages with drive-by exploits.

HOW COMMON IS CLOAKING Wang et al. estimated that 2.2% of Google Searches for trending keywords contained cloaked result. 61% related to certain merchandise 31% results are related to pharmaceutical keywords advertised in spam emails led to cloaked content.

PERSPECTIVE OF CLOAKING The software package for cloaking ranges from $167 to $ 13188. Authors have analyzed the cloaking packages in order to understand 1. Fingerprinting capabilities 2. Switch logic for displaying targeted content 3. Other built-in SEO capabilities i. Content spinning ii. Keyword stuffing Software analysis Use of various languages like C++, Perl, JavaScript,PHP

CLOAKING TECHNIQUES Network Fingerprinting Browser Fingerprinting Contextual Fingerprinting

NETWORK FINGERPRINTING IP Address The list of IP address contained 54,166 unique IP s tied to popular search engines and crawlers at the same time of analysis. Reverse DNS When bot appears from non blacklisted IP, 4/10 cloaking services perform rdns lookup of visitor s IP. The software compares the rdns record against list of domain strings belongs to (Google, Yahoo, etc )

BROWSER FINGERPRINTING Well behaving search and advertisement crawlers announce their presence with special string on operator s website. Ex. Google s googlebot, Microsoft s bingbot. These cloaking services everywhere rely on user-agent comparison and blocks Google, Bing, Yahoo etc.

CONTEXTUAL FINGERPRINTING This technique prevents crawlers from harvesting URLs and visiting them outside the context they first appeared. REDIRECTION TECHNIQUES Stopped at a doorway Redirected to an entirely different domain

DETECTING CLOAKING

DETECTING CLOAKING crawl URLs from the Alexa Top Million, Google Search, and Google Ads to scan for cloaking Fetch each of those URLs via multiple browser, networks to trigger any cloaking logic. compare the similarity of content returned by each crawl feeding the resulting metrics into a classifier that detects divergent content indicative of cloaking

URL SELECTION Collect the URLs. Split the dataset into 2 parts, one for training a classifier based on labeled data (table 1) and second to feed into our classifier to analyze cloaking (table 2) TABLE 1 TABLE 2

BROWSER CONFIGURATION Crawl each URL with 11 different browser and network configurations in attempt to trigger any cloaking logic. Repeat each crawl 3 times to remove noise added by network error. 3 platforms used : Chrome on Desktop, Chrome on Android, basic HTTP

CONTEXT &NETWORK SELECTION Key words Non-ads based URL Filter page s HTML to include visible part and select top 3 frequent words Ads based URL Rely on the keywords the advertiser bids on for targeting ( gather from Google AdWords) Network Google s network, AT&T or Verizon, Google Cloud datacenter, residential IP

SIMILARITY FEATURES Content Similarity detect entirely distinct content by estimating similarity of documents data. clean the data, tokenize the content using sliding window, calculate 64- bit simhash of all token and hamming distance in two simhash. High score of hamming distance in 2 simhash indicate two documents are unique. Screenshot Similarity Visual differences in layout and media presented to browsing profile of the same window dimension Element Similarity Extract the set of embedded images per document, calculate the difference in media content(jaccard Similarity Coefficient)

CLASSIFICATION Used Extremely Randomized Trees - Ensemble, non-linear, supervised learning model - Candidate features and thresholds are selected entirely at random - Normalized all features into a range (0,1) - Relying on ten-fold cross validation

SYSTEM SPECIFICATION Google Compute Engine with crawling and featurization distributed among 20 Ubuntu machines scheduler is built on top of Celery backed by Redis featurization and classification we rely on scikit-learn and Pandas capture and log all network requests via mitmproxy Network used is AT&T, Verizon in prepaid plans

PERFORMANCE AND EVALUATION The overall accuracy of this system is 95.5% Correct detection rate is 99.1% for Alexa URLs as non cloaked with false positive rate of 0.9%

TRAINING SINGLE FEATURE CLASS TRAINING ALL BUT ONE CLASS

CONCLUSION In this work, we explored the cloaking arms race playing between security crawlers and miscreants Compared and classified the content returned for 94,946 labeled URLs, arriving at a system that accurately detected cloaking 95.5% of the time with a false positive rate of 0.9%. In the process, we exposed a gap between current blackhat practices and the broader set of fingerprinting techniques known within the research community which may yet be deployed.

THANK YOU