Inferring Protocol State Machine from Network Traces: A Probabilistic Approach

Similar documents
Inferring Protocol State Machine from Network Traces: A Probabilistic Approach

Automated Extraction of Network Protocol Specifications

Prospex: Protocol Specification Extraction

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A New Technique to Optimize User s Browsing Session using Data Mining

Network Traffic Measurements and Analysis

Association Pattern Mining. Lijun Zhang

"GET /cgi-bin/purchase?itemid=109agfe111;ypcat%20passwd mail 200

Leveraging Set Relations in Exact Set Similarity Join

Message Format and Field Semantics Inference for Binary Protocols Using Recorded Network Traffic

Flow-based Anomaly Intrusion Detection System Using Neural Network

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Hierarchical Document Clustering

Big Data Analytics for Host Misbehavior Detection

Automatic Extraction of Road Intersection Position, Connectivity and Orientations from Raster Maps

Model-Driven Matching and Segmentation of Trajectories

Template Extraction from Heterogeneous Web Pages

Frequent Pattern Mining with Uncertain Data

GSCS Graph Stream Classification with Side Information

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Unsupervised Semantic Parsing

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*

Understanding policy intent and misconfigurations from implementations: consistency and convergence

Intrusion Detection and Malware Analysis

Automated Traffic Classification and Application Identification using Machine Learning. Sebastian Zander, Thuy Nguyen, Grenville Armitage

Dynamic Time Warping

Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning

Performance Based Study of Association Rule Algorithms On Voter DB

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

Web Usage Mining: A Research Area in Web Mining

Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm

Closest Keywords Search on Spatial Databases

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Internet Research Task Force (IRTF) Category: Informational May 2011 ISSN:

A mining method for tracking changes in temporal association rules from an encoded database

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces

Mining High Order Decision Rules

Machine Learning. Unsupervised Learning. Manfred Huber

Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data

Table of Contents 1 Introduction A Declarative Approach to Entity Resolution... 17

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Optimal Cache Allocation for Content-Centric Networking

Multiresolution Motif Discovery in Time Series

Applying Packets Meta data for Web Usage Mining

4. Ad-hoc I: Hierarchical clustering

Towards New Heterogeneous Data Stream Clustering based on Density

Combining Selective Search Segmentation and Random Forest for Image Classification

Feature Selection in Learning Using Privileged Information

Inferring the Source of Encrypted HTTP Connections

Limsoon Wong (Joint work with Mengling Feng, Thanh-Son Ngo, Jinyan Li, Guimei Liu)

MULTI - KEYWORD RANKED SEARCH OVER ENCRYPTED DATA SUPPORTING SYNONYM QUERY

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

International Journal of Software and Web Sciences (IJSWS) Web service Selection through QoS agent Web service

Hierarchical Clustering of Process Schemas

Clustering Part 3. Hierarchical Clustering

Using Visual Motifs to Classify Encrypted Traffic

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

Anonymization of Network Traces Using Noise Addition Techniques

Supervised Web Forum Crawling

Entity Resolution, Clustering Author References

Towards Rule Learning Approaches to Instance-based Ontology Matching

A Novel Method for Activity Place Sensing Based on Behavior Pattern Mining Using Crowdsourcing Trajectory Data

Part I: Data Mining Foundations

Collective Entity Resolution in Relational Data

Clustering the Internet Topology at the AS-level

Internet Traffic Classification Using Machine Learning. Tanjila Ahmed Dec 6, 2017

Can t you hear me knocking

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep

Merging Data Mining Techniques for Web Page Access Prediction: Integrating Markov Model with Clustering

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Record Linkage using Probabilistic Methods and Data Mining Techniques

Netzob Documentation. Release Frédéric Guihéry, Georges Bossert

Modeling and Analyzing 3D Shapes using Clues from 2D Images. Minglun Gong Dept. of CS, Memorial Univ.

ECLT 5810 Clustering

Association Rule Mining. Introduction 46. Study core 46

Polygraph: Automatically Generating Signatures for Polymorphic Worms

Clustering. Chapter 10 in Introduction to statistical learning

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Defining a Better Vehicle Trajectory With GMM

Outline. Motivation. Our System. Conclusion

Attack Patterns Recognition Framework

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web

Dynamic Embeddings for User Profiling in Twitter

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Chapter 2. Related Work

CSE 573: Artificial Intelligence Autumn 2010

Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce

IMPLEMENTATION AND COMPARATIVE STUDY OF IMPROVED APRIORI ALGORITHM FOR ASSOCIATION PATTERN MINING

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming

Extracting Layers and Recognizing Features for Automatic Map Understanding. Yao-Yi Chiang

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Automated Tagging for Online Q&A Forums

Transcription:

Inferring Protocol State Machine from Network Traces: A Probabilistic Approach Yipeng Wang, Zhibin Zhang, Danfeng(Daphne) Yao, Buyun Qu, Li Guo Institute of Computing Technology, CAS Virginia Tech, USA P25-1

Agenda Motivation Challenges Architecture of Veritas Packet Analysis State Message Inference State Machine Inference Experimental Evaluation Conclusions Future Works 2

Motivation 3 Protocol specs is useful in many security applications Traffic classification IDS & DPI Botnet detection Previous works are major in reverse engineering Time consuming & Error prune Codes are not always available Problem: Can we infer protocol specs from traffic automatically, if they are not encrypted?

Challenges How to discover protocol keywords (most frequent strings) from traffic traces? Traffic classification needs keywords to label flows Naïve solution: sequence alignment algorithms (Needlemen-Wuch, DTW ) Not scalable Can not handle multiple keywords Our solution: K-S Filtering 4

Challenges How to obtain protocol state machine from traffic traces? State machine is the model of protocol grammar Botnet behavior description What is a state How many states State labeling within a flow Our solution: Clustering + P-PSM 5

Architecture of Veritas Assumption 1: Traffic is not encrypted For performance concern many applications do not encrypted their traffic Assumption 2 (Single- Protocol Inference): The trace is only composed of flows from the protocol to be investigated Single-protocol inference can be basic work for multiple case 6

Packet Analysis Message Unit Extraction Break flow data into subsequences Count the frequency precisely Problems Length of subsequences: l Length of protocol-related sequences: n 7

K-S Test Filtering What is K-S test Try to determine if two datasets differ significantly Making no assumption about the distribution of data How to do K-S test Building CDF for A & B Building K-S statistic & test 8

Protocol Format Messages Inference Choosing candidate units with K-S filter Select nontrivial units from noises Combine message units Link nontrivial units Get protocol format messages 9

Protocol State Message Inference What is a state? Intuition: Messages with same format share the same interpretations State: Message with distinguishable format Example: EHLO EHLO User EHLO localhost.localdomain How to distinguish a state from others Jaccard index based distance 10

Protocol State Message Inference EHLO User EHLO EHLO localhost.localdomain Using clustering to distinguish states Using Medoid algorithm to cluster How many states (How many clusters): k Using Dunn index to measure clustering quality and get the best k Labeling each packet with a state according to its cluster center 11

State Machine Inference The true state machine Defined in documents Implemented in applications The probabilistic state machine Inferring from traffic traces Data dependent Time series mining based approach 12

State Machine Inference Step 1: State labeling within a flow Step 2: Calculating frequency of each state pair and filtering them with a frequency filter 13

State Machine Inference Step 3: Depicting the linkage of each state pair with a directed labeled graph Step 4: Building the probabilistic protocol state machine (P-PSM) 14

State Machine Inference Step 5: Merging states with same input and output 15

Experimental Evaluation Text Protocol Using ASCII printable characters as protocol format SMTP l: 3 n: 12 k: 12 16

Experimental Evaluation Binary Protocol PPLive l: 3 n: 12 k: 12 17

Experimental Evaluation Binary Protocol XUNLEI l: 3 n: 12 k: 10->5 18

Experimental Evaluation Metric: Completeness (Coverage) Using new flows to pass the inferred machine SMTP flows: 100K 86% PPLive UDP packets: 20K 100% Xunlei UDP packets: 50K 99% 19

Conclusions Veritas: A system that can infer protocol state machine solely from traffic K-S filtering based protocol format extraction approach Clustering based protocol state labeling approach within a flow P-PSM: A probabilistic approach to infer protocol state machine 20

Future works Limitations Data dependent Several parameters Future works Language model based protocol specs inference Semantic inference Multi-protocol inference 21

Gracias! 22

Combine Message Units Input: Φ MAI FRO AIL ROM IL_ OM: L_F. _FR Output: FIND!! Units Combiner DROP!! M A I L F R O M : < y 23 M A I L F R O M : Back

Choosing candidate with K-S filter MAI _FR M:< MAI _ FR M:_ Input: AIL FRO :<y AIL FRO :_p IL_ L_F ROM OM:. Dataset1 Drop! IL_ L_F ROM OM:. Dataset2 K-S Test Filter Thresholdλ 24 Output: Φ MAI AIL IL_ FRO ROM OM: L_F. _FR Candidate Message Units Back

Break data into subsequences M A I L F R O M : < y M A I A I L M : < : < y I L L F F R O M : R O M 25 F R O Back