CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB

CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB CIS 601: Graduate Seminar Prof. S. S. Chung Presented By:- Amol Chaudhari CSU ID 2682329

AGENDA About Introduction Contributions Background How Common is Cloaking Perspective of Cloaking Cloaking Techniques Detecting Cloaking Performance and Evaluation

ABOUT Paper Title Cloak of Visibility -Detecting When Machines Browse A Different Web Publication 2016 IEEE Symposium on Security and Privacy Author Luca Invernizzi, Kurt Thomas and other Google researchers

WHAT IS CLOAKING? Hypothetically Technically

WHAT IS CLOAKING? A search engine optimization(seo) technique. The content presented to the search engine spider different from that presented to the user's browser. Done by delivering content based on the IP addresses or the User- Agent HTTP header of the user requesting the page

CONTRIBUTIONS Provides the first broad study of blackhat cloaking techniques and the companies affected Builds a distributed crawler and classifier detecting and bypassing mobile, search, and ads cloaking, with 95% accuracy and a false positive rate of 0.9% Measures the most prominent search, and ads cloaking techniques in the wild; leads to 4.9% of ads and 11.7% of search results cloak against Google s generic crawler Determine the minimum set of capabilities required of security crawlers to contend with cloaking today

BACKGROUND Web Cloaking Incentives (with bad actors) I) Search results Servers will manipulate fake or compromised pages to appear attractive to crawlers (bots) while organic visitors are guided to (illegal) profit-generating content pages II) Advertisements Cloaking site Miscreants will pay advertising networks to display their URLs. They rely on cloaking to avoid ad policies that strictly prohibit dietary scams, trademark infringing goods, or any form of deceptive advertisements including malware III)Drive-by Download Miscreants compromise popular websites and heavy loaded pages with drive-by exploits.

HOW COMMON IS CLOAKING Wang et al. estimated that 2.2% of Google Searches for trending keywords contained cloaked result. 61% related to certain merchandise 31% results are related to pharmaceutical keywords advertised in spam emails led to cloaked content.

PERSPECTIVE OF CLOAKING The software package for cloaking ranges from $167 to $ 13188. Authors have analyzed the cloaking packages in order to understand 1. Fingerprinting capabilities 2. Switch logic for displaying targeted content 3. Other built-in SEO capabilities i. Content spinning ii. Keyword stuffing Software analysis Use of various languages like C++, Perl, JavaScript,PHP

CLOAKING TECHNIQUES Network Fingerprinting Browser Fingerprinting Contextual Fingerprinting

NETWORK FINGERPRINTING IP Address The list of IP address contained 54,166 unique IP s tied to popular search engines and crawlers at the same time of analysis. Reverse DNS When bot appears from non blacklisted IP, 4/10 cloaking services perform rdns lookup of visitor s IP. The software compares the rdns record against list of domain strings belongs to (Google, Yahoo, etc )

BROWSER FINGERPRINTING Well behaving search and advertisement crawlers announce their presence with special string on operator s website. Ex. Google s googlebot, Microsoft s bingbot. These cloaking services everywhere rely on user-agent comparison and blocks Google, Bing, Yahoo etc.

CONTEXTUAL FINGERPRINTING This technique prevents crawlers from harvesting URLs and visiting them outside the context they first appeared. REDIRECTION TECHNIQUES Stopped at a doorway Redirected to an entirely different domain

DETECTING CLOAKING

DETECTING CLOAKING crawl URLs from the Alexa Top Million, Google Search, and Google Ads to scan for cloaking Fetch each of those URLs via multiple browser, networks to trigger any cloaking logic. compare the similarity of content returned by each crawl feeding the resulting metrics into a classifier that detects divergent content indicative of cloaking

URL SELECTION Collect the URLs. Split the dataset into 2 parts, one for training a classifier based on labeled data (table 1) and second to feed into our classifier to analyze cloaking (table 2) TABLE 1 TABLE 2

BROWSER CONFIGURATION Crawl each URL with 11 different browser and network configurations in attempt to trigger any cloaking logic. Repeat each crawl 3 times to remove noise added by network error. 3 platforms used : Chrome on Desktop, Chrome on Android, basic HTTP

CONTEXT &NETWORK SELECTION Key words Non-ads based URL Filter page s HTML to include visible part and select top 3 frequent words Ads based URL Rely on the keywords the advertiser bids on for targeting ( gather from Google AdWords) Network Google s network, AT&T or Verizon, Google Cloud datacenter, residential IP

SIMILARITY FEATURES Content Similarity detect entirely distinct content by estimating similarity of documents data. clean the data, tokenize the content using sliding window, calculate 64- bit simhash of all token and hamming distance in two simhash. High score of hamming distance in 2 simhash indicate two documents are unique. Screenshot Similarity Visual differences in layout and media presented to browsing profile of the same window dimension Element Similarity Extract the set of embedded images per document, calculate the difference in media content(jaccard Similarity Coefficient)

CLASSIFICATION Used Extremely Randomized Trees - Ensemble, non-linear, supervised learning model - Candidate features and thresholds are selected entirely at random - Normalized all features into a range (0,1) - Relying on ten-fold cross validation

SYSTEM SPECIFICATION Google Compute Engine with crawling and featurization distributed among 20 Ubuntu machines scheduler is built on top of Celery backed by Redis featurization and classification we rely on scikit-learn and Pandas capture and log all network requests via mitmproxy Network used is AT&T, Verizon in prepaid plans

PERFORMANCE AND EVALUATION The overall accuracy of this system is 95.5% Correct detection rate is 99.1% for Alexa URLs as non cloaked with false positive rate of 0.9%

TRAINING SINGLE FEATURE CLASS TRAINING ALL BUT ONE CLASS

CONCLUSION In this work, we explored the cloaking arms race playing between security crawlers and miscreants Compared and classified the content returned for 94,946 labeled URLs, arriving at a system that accurately detected cloaking 95.5% of the time with a false positive rate of 0.9%. In the process, we exposed a gap between current blackhat practices and the broader set of fingerprinting techniques known within the research community which may yet be deployed.

THANK YOU