Automated Application Signature Generation Using LASER and Cosine Similarity

Similar documents
A Hybrid Approach for Accurate Application Traffic Identification

Automated Classifier Generation for Application- Level Mobile Traffic Identification

Internet Traffic Classification Using Machine Learning. Tanjila Ahmed Dec 6, 2017

Automated Traffic Classification and Application Identification using Machine Learning. Sebastian Zander, Thuy Nguyen, Grenville Armitage

Developing the Sensor Capability in Cyber Security

Gene Clustering & Classification

Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning

Traffic Classification Using Visual Motifs: An Empirical Evaluation

Intelligent Edge Computing and ML-based Traffic Classifier. Kwihoon Kim, Minsuk Kim (ETRI) April 25.

Detecting Malicious Hosts Using Traffic Flows

A Hybrid Approach for Accurate Application Traffic Identification

Scalability of the BitTorrent P2P Application

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

FRM: An Alternative Approach to Accurate Application Traffic Identification

Design and Implementation of Virtual TAP for Software-Defined Networks

On the challenges of network traffic classification with NetFlow/IPFIX

Software Architecture for a Lightweight Payload Signature-Based Traffic Classification System

Assessing the Nature of Internet traffic: Methods and Pitfalls

Introduction. Can we use Google for networking research?

OpenAPI-based Message Router for Mashup Service Development

An advanced data leakage detection system analyzing relations between data leak activity

FPGA based Network Traffic Analysis using Traffic Dispersion Graphs

A Method on Multimedia Service Traffic Monitoring and Analysis 1

Polymorphic Blending Attacks. Slides by Jelena Mirkovic

Mosaic: Quantifying Privacy Leakage in Mobile Networks

Can t you hear me knocking

Keywords Machine learning, Traffic classification, feature extraction, signature generation, cluster aggregation.

Internet Traffic Classification using Machine Learning

Enhancing Byte-Level Network Intrusion Detection Signatures with Context

Early Application Identification

Detecting malware even when it is encrypted

Graph-based Detection of Anomalous Network Traffic

ECLT 5810 Clustering

A Hybrid Neural Model for Type Classification of Entity Mentions

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Flowzilla: A Methodology for Detecting Data Transfer Anomalies in Research Networks. Anna Giannakou, Daniel Gunter, Sean Peisert

90 Minute Optimization Life Cycle

QPROBE: DETECTING THE BOTTLENECK IN CELLULAR COMMUNICATION

Online Traffic Classification Based on Sub-Flows

liberate, (n): A library for exposing (traffic-classification) rules and avoiding them efficiently

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Network Traffic Measurements and Analysis

Identify P2P Traffic by Inspecting Data Transfer Behaviour

Diagnosing Wireless Packet Losses in : Collision or Weak Signal?

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities

From Traffic Measurement to Realistic Workload Generation

SCREAM: Sketch Resource Allocation for Software-defined Measurement

Ranking Algorithms For Digital Forensic String Search Hits

Information Retrieval and Organisation

ECLT 5810 Clustering

Network delay estimation to remote locations based on passive RTT measurements A digest of recent related works at Salzburg Research

A System for Real-time Detection and Tracking of Vehicles from a Single Car-mounted Camera

Stochastic Blockmodels as an unsupervised approach to detect botnet infected clusters in networked data

NMLRG #4 meeting in Berlin. Mobile network state characterization and prediction. P.Demestichas (1), S. Vassaki (2,3), A.Georgakopoulos (2,3)

ndpi & Machine Learning A future concrete idea

Can t You Hear Me Knocking: Security and Privacy Threats from ML Application to Contextual Information

Efficient Payload Signature Structure for Performance Improvement of Traffic Identification

Entropy estimation for real-time encrypted traffic identification

YouLighter: An Unsupervised Methodology to Unveil YouTube CDN Changes

CS5670: Computer Vision

Detecting malware even when it is encrypted

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Studying the Impact of Text Summarization on Contextual Advertising

1 A Local Area Network (LAN) consists of four computers and one server. The LAN uses a bus topology.

BitTorrent Traffic Classification

James Won-Ki Hong. Distributed Processing & Network Management Lab. Dept. of Computer Science and Engineering POSTECH, Korea.

Pattern recognition (4)

Heavy-Hitter Detection Entirely in the Data Plane

TMA workshop 2010 April

CAP 6412 Advanced Computer Vision

Polygraph: Automatically Generating Signatures for Polymorphic Worms

RAPTOR: Routing Attacks on Privacy in Tor. Yixin Sun. Princeton University. Acknowledgment for Slides. Joint work with

Keywords Traffic classification, Traffic flows, Naïve Bayes, Bag-of-Flow (BoF), Correlation information, Parametric approach

Identifying and Discriminating Between Web and Peer-to-Peer Traffic in the Network Core

Identifying Traffic Differentiation in Mobile Networks

Experimental Networking Research and Performance Evaluation

Clustering. k-mean clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Software-Defined Networking (Continued)

Application Identification Based on Network Behavioral Profiles

A Multi-agent Based Cognitive Approach to Unsupervised Feature Extraction and Classification for Network Intrusion Detection

White Paper: Next-Gen Network Traffic Analysis (NTA): Log-based NTA vs. Packet-based NTA

Network Traffic Measurements and Analysis

Information, Gravity, and Traffic Matrices

Deep Learning for Computer Vision

Wonsup Lee, Kihyo Jung, Jangwoon Park, Sujin Kim, Sunghye Yoon, Moonsung Kim, and Heecheon You

A New Platform NIDS Based On WEMA

Developing MapReduce Programs

You submitted this homework on Sat 22 Mar :51 AM PDT. You got a score of 5.00 out of 5.00.

TTIC 31190: Natural Language Processing

Network traffic classification: From theory to practice

Understanding Online Social Network Usage from a Network Perspective

New Directions in Traffic Measurement and Accounting. Need for traffic measurement. Relation to stream databases. Internet backbone monitoring

Search Engines. Information Retrieval in Practice

Learning Binary Code with Deep Learning to Detect Software Weakness

Data Informatics. Seon Ho Kim, Ph.D.

Coverage Metrics for Functional Validation of Hardware Designs

Detecting Malicious Activity with DNS Backscatter Kensuke Fukuda John Heidemann Proc. of ACM IMC '15, pp , 2015.

2. Design Methodology

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Transcription:

Automated Application Signature Generation Using LASER and Cosine Similarity Byungchul Park, Jae Yoon Jung, John Strassner *, and James Won-ki Hong * {fates, dejavu94, johns, jwkhong}@postech.ac.kr Dept. of Computer Science and Engineering, POSTECH, Korea * Division of IT Convergence Engineering, POSTECH, Korea April 24, 2010 The 3 rd CAIDA-WIDE-CASFI Joint Measurement Workshop

Introduction Traffic classification based on flow similarity Research goal Overview of proposed methodology Vector space modeling Measuring packet/flow similarity Evaluation Result What is next step? Fine-grained traffic classification Automated application signature generation using LASER and flow similarity Conclusion Contents 2

Introduction Internet traffic classification gains continuous attentions CAIDA have created a structured taxonomy of traffic classification papers and their data set (68 papers, 2009) Various methodologies for traffic classification Accuracy Strength Weakness Port-based Low Low computational cost Low accuracy Signaturebased High Most accurate method Exhaustive signature generation ML-based High Can handle encrypted traffic High complexity Affected by network condition How can we guaranty the classification accuracy with low complexity? Develop a methodology to generate application signature automatically Develop another methodology using packet payload contents 3

Traffic classification based on flow similarity Research goal: a new traffic classification methodology Analyzing payload contents High accuracy and low complexity Document classification Traffic classification Document classification in natural language processing Document Packet (or traffic) Apply a variation of document classification approach to traffic classification Low processing overhead Comparable accuracy to signature-based classification No more exhaustive signature extraction tasks Simple numerical representation of similarity between network traffic 4

Overview of Proposed Methodology Packet Similarity 5

Vector Space Modeling (1/2) An algebraic model representing text document as vectors Widely used in document classification research Payload vector conversion Document classification in natural language processing Document Packet (or traffic) Document classification utilize occurrence Definition of word in payload Payload data within an i-bytes sliding window Word set = 2 (8*sliding window size) Definition of payload vector A term-frequency vector in NLP Payload Vector = [w 1 w 2 w n ] T 6

Vector Space Modeling (2/2) The word size is 2 and the word set size is 2 16 Larger word size dimension of payload vector is increased exponentially 7

Measuring Packet Similarity Cosine Similarity The most common similarity metric in NLP Similarity (p 1, p 2 ) = V(p 1 ) V(p 2 ) V(p 1 ) V(p 2 ) 0: Independent 1: Exactly same Packet Comparison Packet similarity = Cosine Similarity (payload_vector 1, payload_vector 2 ) 0: Payloads are different 1: Payloads are similar 8

Measuring Flow Similarity Payload Flow Matrix (PFM) k payload vectors in a flow Represent a traffic flow PFM = [p 1 p 2 p k ] T where p i is payload Collected PFM Information about target flows Alternative signatures Accumulated empirically to enhance signature word Collected PFMs = a * new PFM + (1 - a) * Collected PFMs PFM 1 PFM 2 PFM 3 PFM m Packets are compared sequentially with only the corresponding packet in the other flow Flow similarity score = packet similarity 9

Measuring Packet Similarity Dataset: traffic trace on one of two Internet junction at POSTECH Traffic Measurement Agent (TMA) Monitoring the network interface of the host Recording log data (5-tuple flow info., process name, packet count, etc) Generating ground-truth to validate traffic classification results 10

Classification Results 100 80 60 40 Classification Accuracy (%) BitTorrent LimeWire Fileguri Youtube Application Classified Traffic (kb) False Negative (kb) False Positive (kb) BitTorrent 202,018 3,361 0 LimeWire 87,678 2,951 0 FileGuri 95,804 9,691 0 YouTube 16,061 0 3,775 TMA Log Traffic 421,339 kb GET / HTTP/1.1 HTTP packet contents User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-us) Connection: Keep-Alive YouTube signal packet contents GET/videoplayback?sparams=id%2Cexprie %2Cip%2ipbits% HTTP/1.1 User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-us) Connection: Keep-Alive 11

Proposed Method vs. LASER Accuracy comparison with our earlier work (LASER, automated signature generation system) Proposed Method LASER Overall Accuracy 96.01% 97.93% 15.00 11.25 7.50 3.75 0 Proposed Method LASER BitTorrent LimeWire Fileguri 12

Summary New traffic classification approach Converting payloads into vector representations Document classification approach to traffic classification Accuracy analysis on representative target applications in the real traffic Contribution No more exhaustive search for payload signatures Achieving simplicity simple numerical representation of similarity in traffic classification Strength Accuracy of classification result was almost same with signature-based classification result (overall accuracy: 96%) Similar to unsupervised ML (clustering) with low complexity Weakness Manual parameter adjustment Scalability problem (efficient for small number of target application) Vector and matrix conversion are required 13

What is Next Step? Fine-grained traffic classification Current traffic classification schemes are only able to discriminate broad application classes or application names Current Scheme Application #1 Usage #1 Traffic Traffic Classification System Application #2 Application #3 Usage #2 Usage #3 One application generates different types of traffic (e.g., P2P: searching, downloading, advertising, messenger, etc) Fine-grained traffic classification can be used for extracting information about application usage Need a new methodology to classify certain application s traffic according to usage of the traffic 14

Proposing New Approach LASER + Flow similarity Stage 1: Preprocess network traffic using flow similarity to classify usage types of traffic Stage 2: Extract application signatures from flows which are grouped by flow similarity Types of traffic generated by a network application (especially P2P app.) are limited Flow similarity might efficient for classifying types of network flow (without scalability problem) Combining two methods can enable to generate application signature fully automated manner 15

Conclusion Traffic classification using flow similarity Converting payloads into vector representations Utilizing document classification approach to traffic classification Provide soft-classification that is represented as a numerical value ranges from 0 to 1 Provide about 95 % classification result regardless of asymmetric routing environment Linear time complexity Fine-grained traffic classification Goal: Develop a methodology to classify certain application s traffic according to usages of the traffic Fine-grained traffic classification can be used for extracting information about application usage Top n applications Top n operations Approach: combining LASER and document classification methodologies 16

Q&A 17