Protecting Web Servers from Web Robot Traffic

Similar documents
CS229 Final Project: Predicting Expected Response Times

Linear Regression Optimization

ASEP: An Adaptive Sequential Prefetching Scheme for Second-level Storage System

Random Search Report An objective look at random search performance for 4 problem sets

1 Case study of SVM (Rob)

CHAPTER 4 OPTIMIZATION OF WEB CACHING PERFORMANCE BY CLUSTERING-BASED PRE-FETCHING TECHNIQUE USING MODIFIED ART1 (MART1)

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

A Soft Computing Prefetcher to Mitigate Cache Degradation by Web Robots

LRC: Dependency-Aware Cache Management for Data Analytics Clusters. Yinghao Yu, Wei Wang, Jun Zhang, and Khaled B. Letaief IEEE INFOCOM 2017

Using Synology SSD Technology to Enhance System Performance Synology Inc.

CAP 6412 Advanced Computer Vision

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Cache Controller with Enhanced Features using Verilog HDL

Finding a needle in Haystack: Facebook's photo storage

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science

arxiv: v3 [cs.ni] 3 May 2017

CS145: INTRODUCTION TO DATA MINING

Artificial Intelligence. Programming Styles

BUYING SERVER HARDWARE FOR A SCALABLE VIRTUAL INFRASTRUCTURE

TABLE OF CONTENTS CHAPTER NO. TITLE PAGE NO. ABSTRACT 5 LIST OF TABLES LIST OF FIGURES LIST OF SYMBOLS AND ABBREVIATIONS xxi

Chapter 3: Supervised Learning

CIT 668: System Architecture. Caching

Virtual Memory. Chapter 8

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 8, NO. 2, APRIL Segment-Based Streaming Media Proxy: Modeling and Optimization

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

Weka ( )

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Predicting User Ratings Using Status Models on Amazon.com

Facial Expression Classification with Random Filters Feature Extraction

Web Usage Mining. Overview Session 1. This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Machine Learning Techniques for Data Mining

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

ECE 669 Parallel Computer Architecture

Search Engines. Information Retrieval in Practice

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Beauty and the Burst

Threshold-Based Markov Prefetchers

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

ENVI: Elastic resource flexing for Network functions VIrtualization

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

CS249: ADVANCED DATA MINING

Achieving Digital Transformation: FOUR MUST-HAVES FOR A MODERN VIRTUALIZATION PLATFORM WHITE PAPER

Machine Learning using MapReduce

Notes on Multilayer, Feedforward Neural Networks

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Week 7: Traffic Models and QoS

CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic

Computationally Efficient M-Estimation of Log-Linear Structure Models

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Tools for Social Networking Infrastructures

A Comparison of Capacity Management Schemes for Shared CMP Caches

Lecture 22 : Distributed Systems for ML

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Popularity Prediction of Facebook Videos for Higher Quality Streaming

Plugging versus Logging: A New Approach to Write Buffer Management for Solid-State Disks

VIRTUAL MEMORY READING: CHAPTER 9

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

A Modular Approach to Document Indexing and Semantic Search

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Massive Scalability With InterSystems IRIS Data Platform

I211: Information infrastructure II

Data Driven Networks. Sachin Katti

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

SSD Admission Control for Content Delivery Networks

Measuring Intrusion Detection Capability: An Information- Theoretic Approach

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Paging algorithms. CS 241 February 10, Copyright : University of Illinois CS 241 Staff 1

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Enterprise Intranet Collaboration Environment

Multimedia Streaming. Mike Zink

Lecture 7: Decision Trees

Fast or furious? - User analysis of SF Express Inc

modern database systems lecture 10 : large-scale graph processing

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Basic Data Mining Technique

Mining Web Data. Lijun Zhang

Data Mining Part 5. Prediction

Tutorials Case studies

Part I: Data Mining Foundations

Theoretical Concepts of Machine Learning

Virtual Memory Design and Implementation

CSE 153 Design of Operating Systems

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

CS 153 Design of Operating Systems Winter 2016

Transcription:

Wright State University CORE Scholar Kno.e.sis Publications The Ohio Center of Excellence in Knowledge- Enabled Computing (Kno.e.sis) 11-14-2014 Protecting Web Servers from Web Robot Traffic Derek Doran Wright State University - Main Campus, derek.doran@wright.edu Follow this and additional works at: http://corescholar.libraries.wright.edu/knoesis Part of the Bioinformatics Commons, Communication Technology and New Media Commons, Databases and Information Systems Commons, OS and Networks Commons, and the Science and Technology Studies Commons Repository Citation Doran, D. (2014). Protecting Web Servers from Web Robot Traffic.. http://corescholar.libraries.wright.edu/knoesis/1046 This Presentation is brought to you for free and open access by the The (Kno.e.sis) at CORE Scholar. It has been accepted for inclusion in Kno.e.sis Publications by an authorized administrator of CORE Scholar. For more information, please contact corescholar@www.libraries.wright.edu.

Protecting Web Servers From Web Robot Traffic Derek Doran derek.doran@wright.edu Department of Computer Science & Engineering Kno.e.sis Research Center Wright State University, Dayton, OH, USA

Outline Introduction and motivation Analysis of Web robot traffic: Robot detection Performance Optimization: Predictive Caching Future research 2

Introduction and motivation Web robots are critical to many functions and services: Internet Search E-Business (shopbots) Private, Proprietary Systems Latest reports (Dec. 2013): over 60% of Web traffic! http://www.incapsula.com/blog/bot-traffic-report- 2013.html 3

Introduction and motivation Within the past 5 years: fundamental shifts in how the Web is used to communicate and share information Dynamic vs. static pages Users produce vs. consume information Subscriptions vs. searching Now, data on the Web has never been more valuable 25% of search results for the largest commercial brands are for usergenerated content 34% of bloggers post opinions about brands 78% of users trust peer recommendations over ads 80% of organizations incorporate social network data in recruitment practices Organizations seek to leverage this valuable, dynamic, time-sensitive data, to stay relevant 4

A New Web Economy 5

Introduction and motivation The volume and intensity of robot traffic will further grow over time! Web servers optimized only to service human traffic with very high performance Workload generation Predictive and proxy caching Optimal queuing, scheduling Unprepared to handle robot traffic - current knowledge of Web traffic may not transcend to robots! Objective: To perform a comprehensive analysis of Web robot traffic, and to prepare Web servers to handle robot requests with high performance 6

Outline Introduction and motivation Analysis of Web robot traffic Robot Detection Preparing Web Servers: Predictive Caching Future research 7

Robot detection Deficiency in state-of-the-art: focuses on finding commonalities across robot sessions Behavior changes over time, and from robot to robot Requirements for more accurate and reliable detection Find distinctions between robots and humans rather than commonalities between robots Root detection on a fundamental difference between human and robot behavior No matter how robots evolve, this difference remains Analytical, self-updateable model As behaviors change over time, so does the detection algorithm 8

Robot detection Fundamental difference: Session request pattern: The order in which resources are requested during a session Properties of human session request pattern: Governed by a Web browser Associated with site structure Target specific resources Properties of robot session request pattern: No governing interface Requests any resources, at any time May target very specific resources depending on functionality 9

Session request pattern Request patterns must be generic enough to characterize many different sessions in a similar way Partition resources into various classes Class text av prog compressed malformed no extension Extensions txt,xml,sty,tex,cpp,java html,asp,jsp,php,cgi,js png,tiff,jpg,ico,raw xls,,ppt,pdf,ps,dvi avi,mp3,wmv,mpg exe,dll,dat,msi,jar zip,rar,gzip,tar,gz,7z Req. strings not well-formed Request for dir. Contents 10

Detection Algorithm Encode session request patterns of robots and humans into two different discrete time Markov Chains (DTMCs) R = (s r, P r ) and H = R = (s r, P r ) Parameters estimated from logs Detection algorithm For an unlabeled session x = x 1, x 2,, x n Compute probability R or H generates x: log(pr x s r, P r ) = log x r 1 + i=2:n log P r x i 1,x i Label x as a robot if Pr x s r, P r > Pr(x s h, P h )

Datasets We consider data from one-year access logs over a variety of servers: Academic: University school of Engineering E-commerce: Univ. of Connecticut University bookstore Digital Archive: Online database of United States Public Opinion Information Millions of access logs across each Web server Using a heuristic approach, divided the logs into robot and human requests Test Set Date Robots Human Academic Mar 2011 4322 6121 Digital Archive Dec 2009 3752 1178 E-commerce Aug 2008 1419 556 12

DTMC Comparison (Behavior Fingerprints) R H Academic Digital Archive E-Commerce

Offline detection Performance evaluated using precision, recall and F1 Precision: true pos. count / true pos. + false pos. count Recall: true pos. count / true pos. + false neg. count F1: harmonic mean of precision, recall

Comparative Analysis Versus state-of-the-art results using various supervised learners

Real-time detection Offline detection is an `after-the-fact analysis Great for log processing; statistical analysis Damage survey Real-time detection catches robots in the act Differentiable treatment of robots and humans Control and handle crawling activities Damage control State-of-the-art methods offer an engineered solution Painful for the users (CAPTCHA) Complex server-side systems target specific classes of robot traffic Difficult to implement and maintain in practice 16

Real-time detection We can adopt our offline algorithm to run in real-time: 1. For every active session s, maintain Pr(s R); Pr(s H) 2. On new request, update Pr(s R), Pr(s H). 3. If number of requests is > k and the difference in log-probabilities exceeds a threshold Δ, classify. Parameter functions: k give Pr(s R), Pr(s H) chance to stabilize Δ tune tradeoff between reliability and need to classify Low Δ: We classify more sessions, but may be less accurate High Δ: Very confident classifications, but sessions may go unlabeled Choice of Δ depends on the Web server 17

Choices of Δ 0.5 < Δ < 2 offers broad degrees of confidence 18

Effect of k, Δ on sessions missed Academic Δ = 1.5; k > 6: ~ 20% of sessions go unclassified Note: Δ = 1.5 is very broad Ex: if Pr(s R) = 0.7, we require Pr(s H) < 0.173 before the log-probability difference exceeds Δ 19

Effect of k, Δ on sessions missed E-Commerce Δ = 1.1; k > 6: ~ 12% of sessions go unclassified 20

Effect of k, Δ on sessions missed Digital Library Δ = 1.1; k > 4: ~ 12% of sessions go unclassified 21

Real-time detection performance Academic Server Good results (F1 > 0.7 at k > 10) False positive rate pulls down F1 FP rate improves with larger requests processed 22

Real-time detection performance E-commerce Server Very strong results (F1 ~ 0.95 for k > 5) Decreasing accuracy for larger k For many requests, robots start to look like humans Balanced by very low FP rate 23

Real-time detection performance Digital Archive Server Great results (F1 > 0.8 for k > 12) Drop in FP rate for k > 12 Accuracy enhanced at k > 12 May be due to Web site structure: static home, log in pages 24

Robot detection Summary Offline detection Across a variety of distinct datasets, strong performance (Approx. F1 > 0.9; ~ 0.73 for Academic Web server) Improvement over state-of-the-art Real-time detection Very strong real-time capability, depending on domain (F1 > 0.75; ~ 0.95 for E-commerce) Decision can be made within a small number of requests (k > 12) Despite strict settings of Δ, low percentage of sessions go unclassified Variation in results across server domains! Interactions between site structure or content? Can this be incorporated in a resource request pattern model? 25

Outline Introduction and motivation Analysis of Web robot traffic: Robot detection Performance Optimization: Predictive Caching Future research 26

Web Caching Web server / cluster caching is a primary means to provide low latency, reduce network bottlenecks Caches store some resources in a smaller, faster, more expensive level of memory (RAM or controller vs. HDD) Very limited size, but very fast access Cache hit: Low-latency response Cache miss: High-latency response due to disk I/O; increases cluster bandwidth; ages Web server Caching polices dictate how and when resources are loaded into a cache 27

Web caching polices Numerous polices exist, built around simple heuristics: Least-recently-used (LRU): keep resources recently accessed in the cache [repeated requests] Log-size: Store as many resources as we can Popularity: Keep frequently requested resources Can we service robot requests with such rules? Robots Do not send repeated requests for same resource May specifically target resources of a given size Could favor different resources compared to humans Different behaviors Handle with separate caches Leverage our offline and real-time detector 28

Proposed Caching Architecture Server Access Logs Offline Detector Human Request Features Robot Request Features Real-time Request Stream Real-time Detector Robot Cache Human Cache Web Cache 29

Predictive robot caching policy Intuition: Detection demonstrated that the type of the next robot request is predictable Resource-based classification finds robots to favor a small number of resource types, captured in request sequences Characterizing robot resource popularity: power-law distribution Idea: Extract sequences of request types from robot sessions Predict type of the next resource Select resources to admit into cache based on frequency of requests within predicted type 30

Learning request sequences Request sequence: types of last n consecutive requests made in a robot session Prediction task: given the order and types of last n-1 requests, predict type of nth request Training Data Index Feature Vector Response (class label) i i+1 i+2 exe txt exe txt exe txt exe txt Stream of robot request types 31

Choosing a classifier NN, SVN, Mult. Log-regression: Only learns features of a request sequence (i has 3, 2, 3, 1 exe; 2 - subsequence) Does not correlate features across training data Nth-order Markov based models: Learns ordering of sequences (i has in pos. 1, i+1 has in pos. 1) High-order needed to capture rich features Elman Neural Network learns using both features and ordering Learns sequence features like a NN Uses layer of context nodes that integrates previously seen sequences throughout training process Feature Vectors exe txt i+2 exe txt i+1 exe i 33

Neural network training 1. Compute output of NN on feature vectors of training data (random initial weights) y 1 y 2 a 2 a 1 x 1 w 1 w 2 w 3 Pr() y 3 a 3 f. vector Label: Example Train seq: x 2 x 3 Pr(noE) Output = [Pr(), Pr(txt), Pr(), Pr(), Pr(av), Pr(prog), Pr(com), Pr(mal), Pr(noe) ] Truth = [ 1, 0, 0, 0, 0, 0, 0, 0, 0 ] Repeat for all training samples 34

Neural network training Define an error function that measures difference from Truth to Output J w = i=1 n c k=1 t ik ln z ik t ik : target output of training sample i at index k z ik : predicted output of training sample i at index k w: network weights learned through training Minimize J w.r.t. each weight w by simultaneously minimizing all partial derivatives J/ w Use stochastic gradient descent to approximate computationally Run network with new weights w, compute new J, re-optimize w Repeat until convergence: J w i 1 J w i < δ 35

Elman neural network training Elman NN Twist: hidden units save state to context units Weight from hidden to context = 1 Weights from context to hidden: additional parameters Pr() Feature Vectors exe txt i+2 exe txt i+1 exe i Pr() Pr() Pr(noe) Pr(comp) Pr(mal) Pr(av) Pr(prog) Pr(txt) 36

Elman neural network training Pr() Feature Vectors exe txt i+2 exe txt i+1 State of network from previous sequence considered in next training iteration Pr() Pr() Pr(noe) Pr(comp) Pr(mal) Pr(av) Pr(prog) Pr(txt) 37

Elman neural network training Feature Vectors exe txt i+2 Pr() Pr() Pr() Pr(noe) Pr(comp) Pr(mal) Pr(av) Pr(prog) Pr(txt) State of network from previous sequences considered in next training iteration 38

Network training and validation Robot Request Stream 362,390 sequences total 01/Aug/2009 Training & validation (40%) Sequences of size k=10 02/Sep/2009 First 40% of requests used to find best # of hidden units for ENN 10-fold cross-validation Test (60%) Evaluate ENN prediction accuracy on rest of data; compare results against many other multinomial predictors 17/Sep/2009 39

Fitting neural network size 40

Comparison of classifiers We compare the classification accuracy of ENN against other typical multinomial classifiers DTMC (learning only by sequence order): Multinomial Logistic Regression (learning only features): Random guess (Correct 1/9 times) Model Accuracy Gain-RG Gain-MLR Gain-DTMC RG 0.111 - - - MLR 0.338 67.16% - - DTMC 0.392 71.68% 16.0% - ENN 0.647 82.84% 47.8% 39.4% Order in request sequences may be a stronger predictor compared to features 41

Robot caching policy After predicting request type, admit the most frequently requested resources within that type into the cache Power-law popularity in robot requests: most frequently requested resources are fetched much more often than others If all resources of a type fit in cache, load popular resources of the 2 nd most likely type Repeat until cache is at capacity 42

Robot caching policy unk txt Types of most recent requests Multinomial Predictor prog txt exe unk txt txt av txt txt Emit probabilities of next request type oth unk unk Web Web Web Web Web Web Web Web Web Web Resources of each type ordered by popularity Web Load cache to capacity with most likely types History of request type sequences Robot Cache Web Cache 43

Caching Performance Compared performance (hit-ratio) of our predictive policy over robot traffic versus suite of baseline polices Log-size: Store smallest resources; maximize # of resources in cache LRU: Store most recently requested resources, evicting oldest resources Popularity: Evict resources requested least frequently Hyper-G: Evict resources requested least frequently, break ties using LRU Popularity-based caches generally used in practice 44

Caching Performance Policy 1MB 2MB 3MB 4MB 5MB 8MB 12MB 20MB 40MB Log-size.055.056.057.057.057.058.058.059.059 LRU.111.126.136.141.145.153.159.165.175 Hyper-G.174.178.172.180.176.188.189.212.236 Pop.192.204.206.205.205.205.223.224.282 ENN.185.199.212.220.228.258.284.335.425 ENN-Gain -3.4% -2.5% 3.78% 6.82% 10.1% 20.5% 21.5% 33.1% 33.6% Note that improvement in hit-ratio grows just logarithmically with cache size Small % improvement equivalent to using a worse policy with an exponentially (cost-prohibitive) larger cache ENN performance grows even stronger with larger cache size 45

Outline Introduction and motivation Analysis of Web robot traffic: Robot detection Performance Optimization: Predictive Caching Future research 46

Future Research Automated robot classification Taxonomy of robot times for finer-grained detection Workload generation Methods that generate representative streams of intertwined robot and human traffic Predictive caching Extension of preliminary results Implementation of real caching algorithm Very exciting work going on here! 47

Thank you for your attention! Questions?