Graph-based Detection of Anomalous Network Traffic

Graph-based Detection of Anomalous Network Traffic Do Quoc Le Supervisor: Prof. James Won-Ki Hong Distributed Processing & Network Management Lab Division of IT Convergence Engineering POSTECH, Korea lequocdo@postech.ac.kr 2012. 06. 22 POSTECH 1/26

Contents Introduction & Motivation Related Work Graph-based Network Traffic Modeling Graph Metrics Anomaly Detection & Attack Identification Validation Conclusion POSTECH 2/26

Introduction & Motivation POSTECH 3/26 The Internet continues to grow in size and complexity Security has become a critical issue. The occurrence of traffic anomalies (DDoS, flash crowds, port scans and worms). Challenges: Increasingly sophisticated attacks. Attacks are often hidden in existing applications, e.g. IRC, HTTP, or Peer-to-Peer: Worm scans or botnet C&C traffic. Methods for detecting traffic anomalies. Signature-based techniques Cannot detect anomalies caused by unknown attacks. Anomaly-based techniques: (Machine learning, data mining the statistical analysis, etc.) Generate a huge number of false alarms. Time consuming. Cannot detect anomalies whose traffic is similar with normal applications (traffic volume, number of packets, number of flows and average packet size).

Introduction & Motivation POSTECH 4/26 Goal: Improve detection accuracy and the ability of the state of art techniques for anomaly detection. Solution: Using a graph-based method to monitor network traffic and analyze the structure of communication patterns to detect anomalies and identify attacks. Why we study the structure of communication patterns in network traffic? Each attack has its own structure. Communication patterns structure changes when attacks occur. Can identify when attacks occur that can be difficult to detect using conventional means.

Contribution POSTECH 5/26 One of the first works using a Traffic Dispersion Graphs (TDGs) to detect anomalies Focus on structural characteristics of networks. Improve performance and ability of the state of the art techniques. Support intuitive visualization of traffic patterns. Introduce a new metric to analyze network traffic communication patterns overtime Implement an online anomaly detection system in an Enterprise network based on the proposed method Evaluate the approach by analyzing real attack traces

Related Work Zhou et al. [1] proposed a network traffic anomaly method based on graph mining Mining time-series graphs. Mining edge weight. Entropy of four attributes: source and destination IP address, source and destination port. The drawback: Enormous size computational complexity. We analyze unlabeled graphs and just concentrate on their nodes Godiyal et al. [2] used a graph matching method to identify attacks Applying isomorphism algorithm for whole traffic flow very time consuming. We identify attacks in abnormal network traffic only POSTECH 6/26

Related Work (cont.) POSTECH 7/26 Iliofotou et al. [3] use TDG to model network traffic as series of related graphs over time Using graph metrics Degree, degree distribution Entropy of degree distribution Graph edit distance Solving problem of traffic classification, possible application to anomaly detection. We model network traffic as TDG over time using new metrics.

Network Traffic Modeling Traffic Dispersion Graph (TDG) Each node IP address. Each edge interaction (flow) between two nodes. D-1 A D-2 B-1 B-2 F-1 D-2 B-1 B-2 F-1 Generated TDG D-1 A POSTECH 8/26

TDG Visualization POSTECH 9/26 HTTP Many disconnected components Very few nodes with in and-out degrees Web proxies? Source: Iliofotou et al. Slammer Worm UDP Dst. port 1434 Many high out-degree nodes Many disconnected components The majority of nodes have only indegree Nodes being scanned

Graph Metrics on TDGs POSTECH 10/26 What we have seen so far: Visualization is useful by itself However, it requires a human operator. Next step? Translate visual intuition into quantitative measures. How to quantitatively characterize properties of TDGs? Step 1: represent traffic as a sequence of graph snapshots. Step 2: use metrics that quantify differences between graphs. G t 0 G t 1 G t 2 G tn G x G y Time What are the differences in communication structure between Gx and Gy?

Graph Metrics on TDGs Static metrics Node degree In-degree Out-degree Degree distribution Show an approximate power-law. Maximum degree (Kmax) One of metrics to detect DDoS attack. Degree Assortativity Measure the tendency for nodes to be connected to similar nodes in term of their degree. Entropy of degree distribution Quantify heterogeneity of network : H X = P k k=1,k max log P k Where P(k) is the probability that a node has degree k. POSTECH 11/26

Graph Metrics on TDGs Dynamic metrics Graph edit distance d G i, G j = V i + V j 2 V i V j + E i + E j 2 E i E j Where V i, E i and V j, E j are the numbers of nodes and edges in graph G i and G j, respectively. dk-2 distance metric Based on dk-series concept Structure analysis - dk-n series: n=1,2,3, Look at inter-dependencies among topology characteristics. dk-n series are degree correlations within simple connected graphs of size n. dk-2 describes joint node degree distribution. dk-2 distance(g,g ) = Euclidean distance between dk- 2(G) and dk-2(g ) POSTECH 12/26

Anomaly Detection & Attack Identification Using graph metrics to detect abnormal network traffic. Anomalies: attacks which change communication structure in network(ddos attacks, Internet worms and scanning) The overall process consist of two parts: anomaly detection and attack identification Network Traffic Flow Anomaly Detection Attack Identification Alarm Figure 4. Overall detection process. POSTECH 13/26

Anomaly Detection & Attack Identification Anomaly Detection Step 1: Sampling network traffic and generating network flows. Step 2: Creating TDG (Dot format) from network flows in time sampling intervals. Step 3: Calculating adjacency matrices of the TDG and calculating graph metrics of the TDG. Step 4: Comparing values of graph metrics of the TDG with their threshold value. Graph metric value < Threshold normal TDG. Graph metric value > Threshold abnormal TDG. Figure 5. Detailed anomaly detection process. POSTECH 14/26

Anomaly Detection & Attack Identification Attack Identification Attack pattern: Figure 7. Attack pattern generation process. Attack identification: Figure 8. DDoS attack pattern in DDoS CAIDA trace. Figure 11. Attack identification process. Figure 9. Peacomm P2P botnet pattern. POSTECH 15/26

Validation POSTECH 16/26 Off-line analysis Trace DARPA 1999 Dataset Week 1 and week 3: no attack (for training data). Week 2: 43 attacks belonging to 18 labeled attack types are used for system development. Week 4 and week 5: 201 attacks belonging to 58 attack types (including 40 new attacks). POSTECH trace in 2009. 7. 9. Contain a famous DDoS attack on July 7, 2009 in South Korea. CAIDA DDoS trace in 2007. P2P Botnet trace (Peacomm) from a honeynet. On-line analysis Real-time anomaly detection Testing with port scanning attack

Validation (DARPA dataset) POSTECH 17/26 DARPA 1999 Dataset Figure 12. Kmax per minute over one day (Monday, Week 5) with normal and attacking traffic. Figure 13. dk-2 distance value per minute over one day (Monday, Week 5) with normal and attacking traffic.

Validation (DARPA dataset) DARPA 1999 Dataset Table 2. Performance of the Graph-based method using Kmax and dk-2 distance metric on Monday, Week5 traffic. Total instances Attacking instances DR FPR CR 1320 122 100 % 1.25 % 98.86 % Table 3. Number of attack instances detected for each attack type on Monday, Week5 traffic. Attack Type Number of attack instances for each attack type Number of detected attack instances for each attack type apache2-dos 30 30 arppoison-probe 15 15 dict-r2l 17 17 guesstelnet-r2l 4 4 ipsweep-prob 26 26 ls-probe 2 2 neptune-dos 5 5 portsweep-probe 4 4 smurf-dos 2 2 udpstorm-dos 16 16 crashiis-dos 1 1 POSTECH 18/26

POSTECH 19/26 Validation (POSTECH July, 2009) POSTECH traces on July, 2009 Date DDoS Attack Trace Size 03/31 No 30.7 GB 07/08 Yes 27.3 GB

Validation (POSTECH July, 2009) POSTECH traces on July, 2009 Figure 15. Kmax value over time of POSTECH s trace on July 8th 2009. Figure 16. dk-2 distance value over time of POSTECH s trace on July 8th 2009. POSTECH 20/26

Validation (POSTECH July, 2009) POSTECH 21/26 POSTECH traces on July, 2009 Postech Normal Trace in 2009 Postech DDoS Trace in 2009.7.9

Validation (Honeynet dataset) Real P2P botnet traffic (Peacomm) trace We executed Trojan Peacomm binary files in a honeynet which consisted of 12 hosts. Synthesized traffic dataset We injected P2P botnet (Peacomm) trace into normal POSTECH traffic trace. POSTECH 22/26

Anomaly Normal Validation (Honeynet Dataset) Results dk-2 Matrices POSTECH 23/26

Validation (Real-time anomaly detection) The real-time anomaly detection system Figure 22. Real-time Anomaly Detection System: Function diagram. Figure 23. Real-time Anomaly Detection System: User Interface. POSTECH 24/26

Validation (Real-time anomaly detection) POSTECH 25/26 Real-time anomaly detection system testing We implemented a Port scanning attack from a host in the dormitory network of our campus to a host outside our campus network. Using TCP Port Scanning tool to generate 100 Port scanning instances Result: DR = 100% and FP = 0. Figure 24. dk2 distance and Kmax value during TCP Port scanning attacks.

Conclusion & Future Work POSTECH 26/26 Conclusion Provide a new approach for anomaly detection. Improve performance of the state of the art techniques. Implement a real-time anomaly detection system based on the proposed method. New way to analyze network traffic for anomaly detection that offers clear visualization. Future work Developing a classifier that determines the thresholds automatically and in a statistical way. Validating our approach with other traces. Using a combination of our metrics and other effective metrics to increase accuracy in terms of anomaly detection and attacks identification.

References POSTECH 27/26 1. Y. Zhou, G. Hu and W. He, Using graph to detect network traffic anomaly, Conference on Communications Circuits and Systems, 2009. 2. A. Godiyal, M. Garland and C.H. John, Enhancing network traffic visualization by graph pattern analysis, 2011. 3. M. Ilifotou, P. Pappu, M. Faloutsos, M. Mitzenmacher, G. Varghese, and H. Kim, Graption: Automated detection of P2P applications using traffic dispersion graphs (TDGs), Tech. Rep. UCR-CS-2008-06080, Department of Computer Science and Engineering, University of California, Riverside, June 2008. 4. S. Voss and J. Subhlok, Performance of general graph isomorphism algorithms, Technical Report UH-CS-09-07, University of Houston, 2010. 5. J.W. Hong, Internet traffic monitoring and analysis using NG-MON, POSTECH, Advanced Communication Technology. The 6th International Conference, vol.1, pp. 100 120, 2004. 6. D. Whitney, Basic Network Metrics. Lecture note, 2008. 7. M. Iliofotou, M. Faloutsos and M. Mitzenmacher, Exploiting dynamicity in graph-based traffic analysis: techniques and applications, in Proceedings of the 5th international conference on Emerging networking experiments and technologies (CoNEXT '09). ACM, New York, NY, USA, 2009, pp. 241 252. 8. T.-F. Yen and M. K. Reiter, Are your hosts trading or plotting? Telling P2P file-sharing and bots apart, In 30th International Conference on Distributed Computing Systems, 2010. 9. D. Q. Le, T. Jeong, H. E. Roman, and J.W. Hong, Traffic Dispersion Graph Based Anomaly Detection, in Proc. of the Second Symposium on Information and Communication Technology (SoICT), Hanoi, Vietnam, Oct. 13-14, 2011, pp. 36 41.

Q & A POSTECH 28/26 Cảm ơn 감사합니다

Appendix POSTECH 29/26

Comparison POSTECH 30/26 Table 2. Performance of the Graph-based method using Kmax and dk-2 distance metric on Monday, Week5 traffic. Method Total instances Attacking instances DR FPR CR Proposed method 1320 122 100 % 1.25 % 98.86 % Wavelet-based method 1320 122 99% 56.97% 53.30%

Appendix (VF2 Algorithm) POSTECH 31/26 Source: P. Figgia

VF2 Algorithm @04 Considering two graph Q and G, the (sub)graph isomorphism from Q to G is expressed as the set of pairs (n,m) (with n G 1, with m G 2 ) 1 A 2 3 B C 2 B 1 A C 3 S 1 S 2 (1, 1) (1, 4) (2, 2) (2, 2) (3, 3) (3, 3) 4 A 32 POSTECH 32/26

VF2 Algorithm Idea: How to find candidate pair sets for a intermediate state? Finding the (sub)graph isomorphism between Q and G is a sequence of state transition. 1 A 1 A 2 B C 2 3 B C 4 A 3 Intermediate States s1 (2,2) s2 (2,2) (1,1) s3 (2,2)(1,1)(3,3) 33 POSTECH 33/26

VF2 Algorithm @04 Let s to be an intermediate state. Actually, s denotes a partial mapping from Q to G, namely, a mapping from a subgraph of Q to a subgraph of G. These two subgraphs are denoted as Q(s) and G(s). All neighbor vertices to Q(s) in graph Q are denoted as NQ(s), and all neighbor vertices to G(s) in graph G are denoted as NG(s). Candidate pair sets are a subset of NQ(s) NG(s). Assume that a pair (n,m) NQ(s) NG(s). 34 POSTECH 34/26

VF2 Algorithm 1 A 2 3 B C 2 B A 1 C 3 (2, 2) Candidate Pair Sets (1, 1) (1, 4) (3, 3) (3,3) 4 A 35 POSTECH 35/26

VF2 Algorithm 36 POSTECH 36/26

POSTECH 37/26 Drawing TDG Drawing Network Traffic Graph? Generate Visualize

Figure 4: DDoS Attack Taxonomy DDoS Attack Bandwidth Depletion Resource Depletion Flood Attack Amplification Attack Protocol Exploit Attack Malformed Packet Attack UDP ICMP Smurf Attack Fraggle Attack TCP SYN Attack PUSH + ACK Attack IP Address Attack IP Packet Options Attack Random Port Attack Same Port Attack Spoof Source IP Address? Spoof Source IP Address? Spoof Source IP Address? Spoof Source IP Address? Spoof Source IP Address? Spoof Source IP Address? Spoof Source IP Address? Direct Attack Loop Attack POSTECH 38/26

Attack Templates POSTECH 39/26 Pattern Specification DDoS Pattern

Attack Templates (1/3) POSTECH 40/26

Attack Templates (2/3) POSTECH 41/26

Attack Templates (3/3) POSTECH 42/26

Thresholds of POSTECH network TCP UDP ICMP Kmax: 5525 dk-2 distance: 11328 Kmax: 15327 dk-2 distance: 23608 Kmax: 1425 dk2: 2996 POSTECH 43/26

NG-MON2 POSTECH 44/26

NAT POSTECH 45/26

Validation (DARPA dataset) DARPA 1999 Dataset Week 1 and week 3: no attack (for training data). Week 2: 43 attacks belonging to 18 labeled attack types are used for system development. Week 4 and week 5: 201 attacks belonging to 58 attack types (including 40 new attacks). The traffic data on Monday, Week 5 of DARPA Dataset Including 122 attack instances. Attacks that change communication structure in network graph: Smurf, apache2, udpstorm, portsweep and etc. POSTECH 46/26

Validation POSTECH 47/26 We use standard measurements such as detection rate (DR), false positive rate (FPR) and overall classification rates (CR) to evaluate our approach. True Positive (TP): The number of anomalous instances that are correctly identified. True Negative (TN): The number of legitimate instances that are correctly classified. False Positive (FP): The number of instances that were incorrectly identified as anomalies, however in fact they are legitimate activities. False Negative (FN): The number of instances that were incorrectly classified as legitimate activities however in fact they are anomalous. DR = TP / (TP + FN) FPR = FP / (TN + FP) CR = (TP + TN) / (TP + TN + FP + FN)

Peacomm POSTECH 48/26 Connect to Overnet The bot publishes itself on the Overnet network and connects to peers. The initial list of peers is hard coded in the bot. Download Secondary Injection URL The bot uses hard coded keys to search for and download a value on the Overnet network. The value is an encrypted URL that points to the location of a secondary injection executable. Decrypt Secondary Injection URL The bot uses a hard coded key to decrypt the downloaded value, which is a URL. Download Secondary Injection The bot downloads the secondary injection from a web server using the decrypted URL. Execute Secondary Injection The bot executes the secondary injection, possibly scheduling future upgrades on the peer-to-peer network or scheduling bot stat tracking at some other resource. http://static.usenix.org/event/hotbots07/tech/full_papers/grizzard/grizzard_html/

Peacomm POSTECH 49/26 Figure 2: Number of Remote IPv4 Addres ses Contacted Over Time for Duration of Infection

POSTECH 50/26

POSTECH 51/26

POSTECH 52/26

POSTECH 53/26

POSTECH 54/26

Graph Metrics on TDGs dk-2 distance Structure analysis - dk-n series: n=1,2,3, Look at inter-dependencies among topology characteristics dk-n series are degree correlations within simple connected graphs of size n Source: Ben Zhao (June 22, 2011) POSTECH 55/26

P2P (1st generation) POSTECH 56/26

Gnutella (2nd generation) POSTECH 57/26

KaZaA (3rd generation) POSTECH 58/26

KaZaA POSTECH 59/26

Distributed Hash Tables (4th generation) POSTECH 60/26

dk-2 value matrix POSTECH 61/26 Normal Anomaly