Network attack analysis via k-means clustering

Similar documents
A Technique by using Neuro-Fuzzy Inference System for Intrusion Detection and Forensics

INTRUSION DETECTION SYSTEM

CHAPTER V KDD CUP 99 DATASET. With the widespread use of computer networks, the number of attacks has grown

A Study on NSL-KDD Dataset for Intrusion Detection System Based on Classification Algorithms

CHAPTER 4 DATA PREPROCESSING AND FEATURE SELECTION

Analysis of FRAUD network ACTIONS; rules and models for detecting fraud activities. Eren Golge

FUZZY KERNEL C-MEANS ALGORITHM FOR INTRUSION DETECTION SYSTEMS

Classification of Attacks in Data Mining

Analysis of Feature Selection Techniques: A Data Mining Approach

Big Data Analytics: Feature Selection and Machine Learning for Intrusion Detection On Microsoft Azure Platform

CHAPTER 2 DARPA KDDCUP99 DATASET

Intrusion Detection System Based on K-Star Classifier and Feature Set Reduction

Independent degree project - first cycle Bachelor s thesis 15 ECTS credits

CHAPTER 5 CONTRIBUTORY ANALYSIS OF NSL-KDD CUP DATA SET

Classification Trees with Logistic Regression Functions for Network Based Intrusion Detection System

Classifying Network Intrusions: A Comparison of Data Mining Methods

Feature Reduction for Intrusion Detection Using Linear Discriminant Analysis

Anomaly detection using machine learning techniques. A comparison of classification algorithms

Data Mining Approaches for Network Intrusion Detection: from Dimensionality Reduction to Misuse and Anomaly Detection

NAVAL POSTGRADUATE SCHOOL THESIS

Unsupervised clustering approach for network anomaly detection

International Journal of Scientific & Engineering Research, Volume 6, Issue 6, June ISSN

A COMPARATIVE STUDY OF CLASSIFICATION MODELS FOR DETECTION IN IP NETWORKS INTRUSIONS

CHAPTER 7 Normalization of Dataset

The Caspian Sea Journal ISSN: A Study on Improvement of Intrusion Detection Systems in Computer Networks via GNMF Method

Association Rule Mining in Big Data using MapReduce Approach in Hadoop

A Hybrid Anomaly Detection Model using G-LDA

Data Reduction and Ensemble Classifiers in Intrusion Detection

Machine Learning for Network Intrusion Detection

Detection of DDoS Attack on the Client Side Using Support Vector Machine

Discriminant Analysis based Feature Selection in KDD Intrusion Dataset

ATwo Stage Intrusion Detection Intelligent System

System Health Monitoring and Reactive Measures Activation

A COMPARATIVE STUDY OF DATA MINING ALGORITHMS FOR NETWORK INTRUSION DETECTION IN THE PRESENCE OF POOR QUALITY DATA (complete-paper)

Towards A New Architecture of Detecting Networks Intrusion Based on Neural Network

Experiments with Applying Artificial Immune System in Network Attack Detection

A Hierarchical SOM based Intrusion Detection System

Intrusion Detection Based On Clustering Algorithm

On Dataset Biases in a Learning System with Minimum A Priori Information for Intrusion Detection

A hybrid network intrusion detection framework based on random forests and weighted k-means

Why Machine Learning Algorithms Fail in Misuse Detection on KDD Intrusion Detection Data Set

An Efficient Decision Tree Model for Classification of Attacks with Feature Selection

Towards an Efficient Anomaly-Based Intrusion Detection for Software-Defined Networks

Combination of Three Machine Learning Algorithms for Intrusion Detection Systems in Computer Networks

ARTIFICIAL INTELLIGENCE APPROACHES FOR INTRUSION DETECTION.

PAYLOAD BASED INTERNET WORM DETECTION USING NEURAL NETWORK CLASSIFIER

DATA REDUCTION TECHNIQUES TO ANALYZE NSL-KDD DATASET

Ranking and Filtering the Selected Attributes for Intrusion Detection System

Journal of Asian Scientific Research EFFICIENCY OF SVM AND PCA TO ENHANCE INTRUSION DETECTION SYSTEM. Soukaena Hassan Hashem

Fuzzy Grids-Based Intrusion Detection in Neural Networks

Genetic-fuzzy rule mining approach and evaluation of feature selection techniques for anomaly intrusion detection

Network Anomaly Detection using Co-clustering

Selecting Features for Intrusion Detection: A Feature Relevance Analysis on KDD 99 Intrusion Detection Datasets

An Intrusion Prediction Technique Based on Co-evolutionary Immune System for Network Security (CoCo-IDP)

Adaptive Framework for Network Intrusion Detection by Using Genetic-Based Machine Learning Algorithm

Signature Based Intrusion Detection using Latent Semantic Analysis

Performance Analysis of Big Data Intrusion Detection System over Random Forest Algorithm

INTRUSION DETECTION WITH TREE-BASED DATA MINING CLASSIFICATION TECHNIQUES BY USING KDD DATASET

Anomaly Intrusion Detection System Using Hierarchical Gaussian Mixture Model

Feature Selection in UNSW-NB15 and KDDCUP 99 datasets

Visualization of anomaly detection using prediction sensitivity

RUSMA MULYADI. Advisor: Dr. Daniel Zeng

Deep Feature Extraction for multi-class Intrusion Detection in Industrial Control Systems

A Back Propagation Neural Network Intrusion Detection System Based on KVM

LAWRA: a layered wrapper feature selection approach for network attack detection

This is a repository copy of Deep Learning Approach for Network Intrusion Detection in Software Defined Networking.

Learning Intrusion Detection: Supervised or Unsupervised?

INTRUSION DETECTION MODEL IN DATA MINING BASED ON ENSEMBLE APPROACH

Analysis of network traffic features for anomaly detection

Using MongoDB Databases for Training and Combining Intrusion Detection Datasets

Distributed Detection of Network Intrusions Based on a Parametric Model

Modeling An Intrusion Detection System Using Data Mining And Genetic Algorithms Based On Fuzzy Logic

A Rough Set Based Feature Selection on KDD CUP 99 Data Set

Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing

* This manuscript has been accepted for publication in IET Networks.

Rough Set Discretization: Equal Frequency Binning, Entropy/MDL and Semi Naives algorithms of Intrusion Detection System

An Active Rule Approach for Network Intrusion Detection with Enhanced C4.5 Algorithm

IDuFG: Introducing an Intrusion Detection using Hybrid Fuzzy Genetic Approach

Comparative Analysis of Classification Algorithms on KDD 99 Data Set

An Intelligent CRF Based Feature Selection for Effective Intrusion Detection

A Combined Anomaly Base Intrusion Detection Using Memetic Algorithm and Bayesian Networks

Mining Audit Data for Intrusion Detection Systems Using Support Vector Machines and Neural Networks

Anomaly based Network Intrusion Detection using Machine Learning Techniques.

Comparison of variable learning rate and Levenberg-Marquardt back-propagation training algorithms for detecting attacks in Intrusion Detection Systems

Anomaly Intrusion Detection System Using Information Theory, K-NN and KMC Algorithms

Intrusion detection system with decision tree and combine method algorithm

Network Forensics Method Based on Evidence Graph and Vulnerability Reasoning

Network Traffic Anomaly Detection Based on Packet Bytes ABSTRACT Bugs in the attack. Evasion. 1. INTRODUCTION User Behavior. 2.

Contribution of Four Class Labeled Attributes of KDD Dataset on Detection and False Alarm Rate for Intrusion Detection System

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) PROPOSED HYBRID-MULTISTAGES NIDS TECHNIQUES

Performance improvement of intrusion detection with fusion of multiple sensors

Intrusion Detection -- A 20 year practice. Outline. Till Peng Liu School of IST Penn State University

Intrusion Detection of Multiple Attack Classes using a Deep Neural Net Ensemble

A Discriminatory Model of Self and Nonself Network Traffic

Analysis of neural networks usage for detection of a new attack in IDS

Using Domain Knowledge to Facilitate Cyber Security Analysis

Hybrid Modular Approach for Anomaly Detection

Research Article A Universal High-Performance Correlation Analysis Detection Model and Algorithm for Network Intrusion Detection System

Journal of Babylon University/Pure and Applied Sciences/ No.(8)/ Vol.(21): 2013

Two Level Anomaly Detection Classifier

Transcription:

Network attack analysis via k-means clustering - By Team Cinderella Chandni Pakalapati cp6023@rit.edu Priyanka Samanta ps7723@rit.edu Dept. of Computer Science

CONTENTS Recap of project overview Analysis of Research Paper 1 Analysis of Research Paper 2 Analysis of Research Paper 3

PROJECT OVERVIEW : SUMMARY Problem Statement Detection of Network Intrusion The network is prone to the following kinds of attacks: DoS ( Denial of Services) Probing or Surveillance attacks User-to-root Remote-to-local Our approach K- Means Clustering Algorithm To find the hidden pattern in the data K-Means Clustering is a NP Hard problem Data to be used NSL-KDD Cup 1999 Dataset Link: https://kdd.ics.uci.edu/databases/kddcup99/kdd cup99.html Our goal To parallelize the K-Means algorithm

RESEARCH PAPER 1 Title: K-Means Clustering Approach to Analyze NSL-KDD Intrusion Detection Dataset Author: Vipin Kumar, Himadri Chauhan, Dheeraj Panwar Journal Name: International journal of soft computing and engineering ISSN: 2231-2307 Volume -3, Issue -4, Date: 09/01/2013 Page Number: 1-4 URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.413.589&rep=rep1&type=pdf

RESEARCH PAPER 1 : PROBLEM STATEMENT Detection of Network Intrusion by performing unsupervised data mining technique of clustering on unlabeled data. Analysis of the NSL-KDD dataset using K-Means Clustering. Clustering the dataset into normal and the following network attack types: DoS (Denial of Services) Probe R2L U2R

RESEARCH PAPER 1 : APPROACH TO THE PROBLEM K-Means is few of the simplest, unsupervised clustering algorithm that can cluster unlabeled data. Initialization of k prototypes (w 1,, w k ) [ k is the number of clusters ] [1] Each cluster C i is associated with prototype w i Repeat for each input data record d i, where i ϵ {1, n}, do Assign di to the cluster C j * with nearest prototype w j for each cluster C j, where j ϵ {1,, k}, do Update the prototype wj to be the centroid of all samples currently in C i, so that w i = d i ϵ C i / C i

RESEARCH PAPER 1 : CONTRIBUTION Understanding the multi dimension NSL-KDD dataset. Records from the dataset contains the following 41 attributes. [1] duration src_bytes dst_bytes land wrong_fragment urgent num_failed_logins logged_in num_compromised root_shell su_attempted num_root num_shells num_access_files num_outbound_cmds is_host_login is_guest_login count serror_rate srv_serror_rate rerror_rate srv_rerror_rate same_srv_rate diff_srv_rate dst_host_count dst_host_srv_count dst_host_same_srv_rate dst_host_diff_srv_rate dst_host_same_src_port_rate dst_host_srv_diff_host_rate dst_host_srv_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate protocol_type service flag hot num_file_creations srv_count srv_diff_host_rate dst_host_serror_rate class Class attributes specifies the class of the network traffic. It may be normal of on of the following 37 different kinds of attacks. Categorization of 37 different attack types into 4 attack categories. [1] DoS Probe R2L U2R Back, Land, Neptune, Pod, Smurf, teardrop, Mailbomb, Processtable, Udpstorm, Apache2, Worm Satan, Ipsweep, Nmap, Portsweep, Mscan, Saint Guess_Password, Ftp_write, Imap, Phf, Multihop, Warezmaster, Xlock, xsnoop, Snmpguess, Snmpgetattack, Httptunnel, Sendmail, Named Buffer_overflow, Loadmodule, Rootkit, Perl, Sqlattack, Xterm, Ps

RESEARCH PAPER 1 : DRAWBACKS It requires the entire dataset to be stored in the main memory. Involves high computational complexity. [ O(n*k*i*d) n=number of data records, k=number of clusters, i= number of iterations, d= dimensionality of the record ]

RESEARCH PAPER 1 : TAKEAWAY FOR OUR IMPLEMENTATION Knowing the attributes to work with. Understanding the types of attacks and their belongings to one of the 4 attack categories. Prototype to be considered is the centroid of the cluster. With the cluster size of 4, the optimal result can be achieved. Using Euclidean distance metric to compute the dissimilarity between the data records.

RESEARCH PAPER 2 Title: Parallelizing k-means Algorithm for 1-d Data Using MPI Author: Ilias K. Savvas and Georgia N. Sofianidou Journal Name: 2014 IEEE 23rd International WETICE Conference Date: 2014 Page Number: 179 184 URL: http://ieeexplore.ieee.org.ezproxy.rit.edu/stamp/stamp.jsp?tp=&arnumber=6927046

RESEARCH PAPER 2 : PROBLEM STATEMENT Reducing the computational complexity of the K-Means by implementing parallel K-means.

RESEARCH PAPER 2 : APPROACH TO THE PROBLEM It follows the Single Instruction Multiple Data (SIMD) paradigm. It works on the concept of master and worker nodes. Communication between the master and worker nodes are performed using MPI ( Message Passing Interface ).

RESEARCH PAPER 2 : CONTRIBUTION Parallel K-Means Algorithm [2] Master node, M finds the number of available worker nodes, assume : (N 1) M: splits input dataset D into D/(N 1) subsets M: transfers data to worker nodes for all Worker nodes w i, w i where i {1, 2, 3,...,N 1}, do in parallel Receive the data chunk from the master Perform K-Means clustering Send the local centroids and the number of data records assigned to each one of them to M end for M receives centroids and the corresponding data records assigned to each of the centroids. M sorts the centroid list M calculates global centroids by applying the weighted arithmetic mean and produce the final clustering.

RESEARCH PAPER 2 : CONTRIBUTION Parallel K-Means Algorithm [2] Master node, M finds the number of available worker nodes, assume: (N 1) M: splits input dataset D into D/(N 1) subsets M: transfers data to worker nodes for all Worker nodes w i, w i where i {1, 2, 3,...,N 1}, do in parallel Receive the data chunk from the master Perform K-Means clustering Send the local centroids and the number of data records assigned to each one of them to M end for M receives centroids and the corresponding data records assigned to each of the centroids. M sorts the centroid list M calculates global centroids by applying the weighted arithmetic mean and produce the final clustering. W1 W2 M W3 W4 C1 C2 C3 C4 Collecting local centroids M C 9/22/2015

RESEARCH PAPER 2 : UNDERSTANDING K-MEANS WITH AN EXAMPLE Input: Dataset with 40 data records, k=2 Creation of the chunks: Subset1 Subset2 Subset3 Subset4 12 13 39 20 3 7 19 21 30 31 32 26 8 2 16 27 18 4 17 28 15 25 22 29 23 10 38 34 24 9 33 35 5 14 40 36 6 1 11 37 Centroids with the number of data records assigned to each centroid Centroids Data records assigned 15.5 7 22.5 3 8.2 6 16.7 4 35.4 5 27 5 23.5 6 Sorting the centroids Centroids Data records assigned 8.2 6 15.5 7 16.7 4 22.5 3 23.5 6 27 5 35.4 5 36 4 36 4 Calculating global centroid C1 = 8.2*6 + 15.5*7 + 16.7*4 + 22.5*3 / (7+3+6+4) = 14.6 C2 = 23.5*6 + 27*5 + 35.4*5 + 36*4 / (5+5+6+4) = 29.9

RESEARCH PAPER 2 : RESULTS Parallel K-Means efficiency increases with an increasing number of clusters. Parallel K-Means is highly scalable. Fig 1: Increasing K and number of nodes [2] Fig 2: With K=2 increasing number of nodes [2]

RESEARCH PAPER 2 : DRAWBACK Message Passing overhead is involved in sending and receiving information across the master and worker nodes.

RESEARCH PAPER 2 : TAKEAWAY FOR OUR IMPLEMENTATION Improved performance of K-Means clustering by divide and merge mechanism (parallelizing). The results obtained are same to the results generated from the original K-Means clustering.

RESEARCH PAPER 3 Title: High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs Author: Ali Hadian, Saeed Shahrivari Journal Name: The Journal of Supercomputing ISSN: 0920-8542 Date: 08/2014 Page Number: 845 863 URL: http://link.springer.com.ezproxy.rit.edu/article/10.1007%2fs11227-014-1185-y

RESEARCH PAPER 3 : PROBLEM STATEMENT Implementation of parallel K-Means clustering utilizing the maximum capabilities of multiple cores of a computer.

RESEARCH PAPER 3 : APPROACH TO THE PROBLEM The algorithm takes a parallel processing approach utilizing the available cores in the CPU. Divide the dataset into multiple chunks. Cluster each of the chunks. Aggregate the chunk clustering and make the final cluster. This algorithm does only single pass over the dataset.

RESEARCH PAPER 3 : CONTRIBUTION Concepts of master thread and chunk-clustering thread are integral part of the algorithm. Input: K, chunk_size, dataset Algorithm: Master thread [3] Initialize chunks_queue [queues shared between threads] Initialize centroids[] [centroids shared between threads] while Not End_Of_Dataset do new_chunk load_next_chunk(data_set, chunk_size) enqueue(new_chunk, chunks_queue) if (chunks_queue is full) then wait(chunks_queue) [wait until a consumer thread dequeues a chunk] end if end while while chunks_queue is not empty do wait(chunks_queue) [wait for all chunks in queue to be clustered] end while C Cluster [using K-means++ clustering]

RESEARCH PAPER 3 : CONTRIBUTION (contd..) Concepts of master thread and chunk-clustering thread are integral part of the algorithm. Input: chunks_queue, centroids[], K(Number of clusters for each chunk) Algorithm: Chunk Clustering thread [3] while queue is not empty & main thread is alive do chunk_instances Dequeue(queue) C Cluster centroids[] centroids[] C if (queue is empty) then end if end while wait(queue) [wait until a consumer thread dequeues a chunk] With the clustered chunks, the master thread loads the centroid list. With the centroid list the master thread performs k-means++ clustering to find the final cluster

RESEARCH PAPER 3 : RESULTS The algorithm results in higher speedups when the data size is large enough. Fig 3: Speedup of the algorithm [3]

RESEARCH PAPER 3 : TAKEAWAY FOR OUR IMPLEMENTATION Utilizing multiple cores in the machine to cluster the dataset at the fastest.

REFERENCE [1] K-Means Clustering Approach to Analyze NSL-KDD Intrusion Detection Dataset Vipin Kumar, Himadri Chauhan, Dheeraj Panwar Journal Name: International journal of soft computing and engineering ISSN: 2231-2307 Volume -3, Issue -4, Date: 09/01/2013 Page Number: 1-4 URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.413.589&rep=rep1&type=pdf [2] Parallelizing k-means Algorithm for 1-d Data Using MPI Ilias K. Savvas and Georgia N. Sofianidou Journal Name: IEEE 23rd International WETICE Conference Date: 2014 Page Number: 179 184 URL: http://ieeexplore.ieee.org.ezproxy.rit.edu/stamp/stamp.jsp?tp=&arnumber=6927046 [3] High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs Ali Hadian, Saeed Shahrivari Journal Name: The Journal of Supercomputing ISSN: 0920-8542 Date: 08/2014 Page Number: 845 863 URL: http://link.springer.com.ezproxy.rit.edu/article/10.1007%2fs11227-014-1185-y