Analysis of FRAUD network ACTIONS; rules and models for detecting fraud activities. Eren Golge

Similar documents
Analysis of Feature Selection Techniques: A Data Mining Approach

Classification Trees with Logistic Regression Functions for Network Based Intrusion Detection System

CHAPTER V KDD CUP 99 DATASET. With the widespread use of computer networks, the number of attacks has grown

Network attack analysis via k-means clustering

CHAPTER 5 CONTRIBUTORY ANALYSIS OF NSL-KDD CUP DATA SET

A Technique by using Neuro-Fuzzy Inference System for Intrusion Detection and Forensics

Intrusion Detection System Based on K-Star Classifier and Feature Set Reduction

INTRUSION DETECTION SYSTEM

Detection of DDoS Attack on the Client Side Using Support Vector Machine

FUZZY KERNEL C-MEANS ALGORITHM FOR INTRUSION DETECTION SYSTEMS

Big Data Analytics: Feature Selection and Machine Learning for Intrusion Detection On Microsoft Azure Platform

CHAPTER 4 DATA PREPROCESSING AND FEATURE SELECTION

A Hybrid Anomaly Detection Model using G-LDA

A Study on NSL-KDD Dataset for Intrusion Detection System Based on Classification Algorithms

Feature Reduction for Intrusion Detection Using Linear Discriminant Analysis

Data Reduction and Ensemble Classifiers in Intrusion Detection

Machine Learning for Network Intrusion Detection

System Health Monitoring and Reactive Measures Activation

Anomaly detection using machine learning techniques. A comparison of classification algorithms

NAVAL POSTGRADUATE SCHOOL THESIS

International Journal of Scientific & Engineering Research, Volume 6, Issue 6, June ISSN

Association Rule Mining in Big Data using MapReduce Approach in Hadoop

CHAPTER 2 DARPA KDDCUP99 DATASET

An Efficient Decision Tree Model for Classification of Attacks with Feature Selection

Classification of Attacks in Data Mining

Classifying Network Intrusions: A Comparison of Data Mining Methods

PAYLOAD BASED INTERNET WORM DETECTION USING NEURAL NETWORK CLASSIFIER

Journal of Asian Scientific Research EFFICIENCY OF SVM AND PCA TO ENHANCE INTRUSION DETECTION SYSTEM. Soukaena Hassan Hashem

A COMPARATIVE STUDY OF CLASSIFICATION MODELS FOR DETECTION IN IP NETWORKS INTRUSIONS

A hybrid network intrusion detection framework based on random forests and weighted k-means

Adaptive Framework for Network Intrusion Detection by Using Genetic-Based Machine Learning Algorithm

Why Machine Learning Algorithms Fail in Misuse Detection on KDD Intrusion Detection Data Set

DATA REDUCTION TECHNIQUES TO ANALYZE NSL-KDD DATASET

ATwo Stage Intrusion Detection Intelligent System

Genetic-fuzzy rule mining approach and evaluation of feature selection techniques for anomaly intrusion detection

Fuzzy Grids-Based Intrusion Detection in Neural Networks

Analysis of network traffic features for anomaly detection

The Caspian Sea Journal ISSN: A Study on Improvement of Intrusion Detection Systems in Computer Networks via GNMF Method

Signature Based Intrusion Detection using Latent Semantic Analysis

A Back Propagation Neural Network Intrusion Detection System Based on KVM

Independent degree project - first cycle Bachelor s thesis 15 ECTS credits

Network Anomaly Detection using Co-clustering

Feature Selection in UNSW-NB15 and KDDCUP 99 datasets

LAWRA: a layered wrapper feature selection approach for network attack detection

Experiments with Applying Artificial Immune System in Network Attack Detection

Network Forensics Method Based on Evidence Graph and Vulnerability Reasoning

An Intelligent CRF Based Feature Selection for Effective Intrusion Detection

Anomaly based Network Intrusion Detection using Machine Learning Techniques.

Modeling An Intrusion Detection System Using Data Mining And Genetic Algorithms Based On Fuzzy Logic

A COMPARATIVE STUDY OF DATA MINING ALGORITHMS FOR NETWORK INTRUSION DETECTION IN THE PRESENCE OF POOR QUALITY DATA (complete-paper)

Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing

Selecting Features for Intrusion Detection: A Feature Relevance Analysis on KDD 99 Intrusion Detection Datasets

Contribution of Four Class Labeled Attributes of KDD Dataset on Detection and False Alarm Rate for Intrusion Detection System

Novel Technique of Extraction of Principal Situational Factors for NSSA

Modeling Intrusion Detection Systems With Machine Learning And Selected Attributes

Rough Set Discretization: Equal Frequency Binning, Entropy/MDL and Semi Naives algorithms of Intrusion Detection System

Cloud Computing Intrusion Detection Using Artificial Bee Colony-BP Network Algorithm

A Combined Anomaly Base Intrusion Detection Using Memetic Algorithm and Bayesian Networks

Intrusion Detection System with FGA and MLP Algorithm

Visualization of anomaly detection using prediction sensitivity

Journal of Babylon University/Pure and Applied Sciences/ No.(8)/ Vol.(21): 2013

ARTIFICIAL INTELLIGENCE APPROACHES FOR INTRUSION DETECTION.

Data Mining Approaches for Network Intrusion Detection: from Dimensionality Reduction to Misuse and Anomaly Detection

Autonomic GIG Management & Security Agent Technology

Toward Building Lightweight Intrusion Detection System Through Modified RMHC and SVM

A FEATURE SELECTION APPROACH USING BINARY FIREFLY ALGORITHM FOR NETWORK INTRUSION DETECTION SYSTEM

Collaborative Security Attack Detection in Software-Defined Vehicular Networks

An Active Rule Approach for Network Intrusion Detection with Enhanced C4.5 Algorithm

NETWORK INTRUSION DETECTION SYSTEMS USING RANDOM FORESTS ALGORITHM

Extracting salient features for network intrusion detection using machine learning methods

Bayesian Learning Networks Approach to Cybercrime Detection

ScienceDirect. Analysis of KDD Dataset Attributes - Class wise For Intrusion Detection

Anomaly Intrusion Detection System Using Hierarchical Gaussian Mixture Model

A Neuro-Fuzzy Classifier for Intrusion Detection Systems

A Comparative Study of Hidden Markov Model and Support Vector Machine in Anomaly Intrusion Detection

Performance Analysis of Big Data Intrusion Detection System over Random Forest Algorithm

Intrusion Detection System using Cascade Forward Neural Network with Genetic Algorithm Based Feature Selection

Improving the Attack Detection Rate in Network Intrusion Detection using Adaboost Algorithm

On Dataset Biases in a Learning System with Minimum A Priori Information for Intrusion Detection

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Detection of Network Intrusions with PCA and Probabilistic SOM

Performance Analysis of various classifiers using Benchmark Datasets in Weka tools

PROJECT 1 DATA ANALYSIS (KR-VS-KP)

7 Techniques for Data Dimensionality Reduction

Analysis of neural networks usage for detection of a new attack in IDS

Feature Selection in the Corrected KDD -dataset

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Intrusion Detection System based on Support Vector Machine and BN-KDD Data Set

Intrusion Detection System Using Hybrid Approach by MLP and K-Means Clustering

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

From Feature Selection to Building of Bayesian Classifiers: A Network Intrusion Detection Perspective

A Feature Selection for Intrusion Detection System Using a Hybrid Efficient Model

A Comparative Study of Supervised and Unsupervised Learning Schemes for Intrusion Detection. NIS Research Group Reza Sadoddin, Farnaz Gharibian, and

REVIEW OF VARIOUS INTRUSION DETECTION METHODS FOR TRAINING DATA SETS

An Ensemble Data Mining Approach for Intrusion Detection in a Computer Network

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

DIMENSIONALITY REDUCTION FOR DENIAL OF SERVICE DETECTION PROBLEMS USING RBFNN OUTPUT SENSITIVITY

Abnormal Network Traffic Detection Based on Semi-Supervised Machine Learning

Evaluating the Strengths and Weaknesses of Mining Audit Data for Automated Models for Intrusion Detection in Tcpdump and Basic Security Module Data

Optimized Intrusion Detection by CACC Discretization Via Naïve Bayes and K-Means Clustering

Transcription:

Analysis of FRAUD network ACTIONS; rules and models for detecting fraud activities Eren Golge

FRAUD? HACKERS!! DoS: Denial of service R2L: Unauth. Access U2R: Root access to Local Machine. Probing: Survallience.... First PC Viruses: Prophets of PC Viruses Our Case : Fraud = Anomally

EXPECTATIONs? concise data? (Feature Weightining and Selection) model for classification? (Model Extration) what indicates fraud? (Rule extraction)

Constraints DATA HUGE UNBALANCED HIGH DIMESNIONAL NOMINAL+NUMERICAL FEATs DIVERGENT CASES MACHINE MEMORY TIME

DATA KDD'99 Data TCP dump on LAN -> Lincoln Labs-MIT ** Peppered with attacks. 4 million records (reduction need) 42 Features (selection need) 16 Attact Type (merge as anomaly) Deficient for detecting content based attackts.* * Kayacık, H. G., Zincir-heywood, A. N., & Heywood, M. I. (n.d.). Selecting Features for Intrusion Detection : A Feature Relevance Analysis on KDD 99 Intrusion Detection Datasets, 3 8. ** http://www.ll.mit.edu/

Instances Instance 2 sec snapshots of connection. Connection TCP packages between same two IPs Each instance ~ 80 bytes

43 features Categories Host Service (Telnet, Voip) Content Exp. Root access, Shell activation TCP stats Destination host 4 nominal vs 39 numeric features

TCP stats

Content Features

Same Host & Same Service

Destination Host @attribute 'dst_host_count' @attribute 'dst_host_srv_count' @attribute 'dst_host_same_srv_rate' @attribute 'dst_host_diff_srv_rate' @attribute 'dst_host_same_src_port_rate' @attribute 'dst_host_srv_diff_host_rate' @attribute 'dst_host_serror_rate' @attribute 'dst_host_srv_serror_rate' @attribute 'dst_host_rerror_rate' @attribute 'dst_host_srv_rerror_rate' real real real real real real real real real real

All List duration: continuous. protocol_type: symbolic. service: symbolic. flag: symbolic. src_bytes: continuous. dst_bytes: continuous. land: symbolic. wrong_fragment: continuous. urgent: continuous. hot: continuous. num_failed_logins: continuous. logged_in: symbolic. num_compromised: continuous. root_shell: continuous. su_attempted: continuous. num_root: continuous. num_file_creations: continuous. num_shells: continuous. num_access_files: continuous. num_outbound_cmds: continuous. is_host_login: symbolic. is_guest_login: symbolic. count: continuous. srv_count: continuous. serror_rate: continuous. srv_serror_rate: continuous. rerror_rate: continuous. srv_rerror_rate: continuous. same_srv_rate: continuous. diff_srv_rate: continuous. srv_diff_host_rate: continuous. dst_host_count: continuous. dst_host_srv_count: continuous. dst_host_same_srv_rate: continuous. dst_host_diff_srv_rate: continuous. dst_host_same_src_port_rate: continuous. dst_host_srv_diff_host_rate: continuous. dst_host_serror_rate: continuous. dst_host_srv_serror_rate: continuous. dst_host_rerror_rate: continuous. dst_host_srv_rerror_rate: continuous. class : symbolic 4 symbolic (nominal) + 39 numeric

Path to Gotcha Moment Step1: Preprocessing Packages used: Rapidminer Knime Weka R Best performance on slide General Work Iteration Run the code Go out, waste time Come back Normalization Scaling Class merging Data Filtering Step2: Feat. Selection Genetic Algorithm (Out of curiosity) Backward Selection Supervision Step3: Model Learning Xval. Parameter optimizing SVM Neural Net. Naive Bayes Decision Trees Step4: Rule Learning Decision Trees Naive Bayes Step5: Eval. Of Anomaly detection Get Useless Results by LOF (not included) Compare with supervised method

Step1: Clear Data Remove redundant data 4,898,431 -> 1,074,991 (742MB -> 18.7MB) Attack : 3,925,650 -> 262,178 (cause deficinecy?) Normal : 972,781 -> 812,814 Reduce majority effect Random Selection 1,074,991 -> 125,973 (57,666 Ano. Vs 67,388 - Normal) Keep amount of fraud types same Decrease payload of Normal instances (cost of Anomaly is bigger) Merge class values FraudActivities (dos,probe) -> anomaly

Step2: Future Selection Genetic Algorithms vs Backward Selection + ROC curves Correlation Matrix

Step2: Feature Selection Genetic algorithm : 42 feats => 12 feats NaiveBayes( ) { returns fittness; } PerformanceVector: accuracy: 84.67% ConfusionMatrix: True: anomaly normal anomaly:10912 1536 normal: 1921 8175 Without feature selection PerformanceVector: accuracy: 77.58% ConfusionMatrix: True: anomaly normal anomaly:8540 761 normal: 4293 8950

Step2 Backward elemination with Naive Bayes PerformanceVector: accuracy: 89.46% ConfusionMatrix: True: anomaly normal anomaly: 12257 1800 normal: 576 7911 BETTER but LONGER! 42 feats => 30 feats IMPROVEMENT! Without feature selection GENETIC ALGO. İs good if you have time constraint! %84 Accuracy PerformanceVector: accuracy: 77.58% ConfusionMatrix: True: anomaly normal anomaly:8540 761 normal: 4293 8950

Step2: After unsupervised selection 42 features => 31 features SELECTED FEATURES protocol_type service flag land urgent hot num_compromised root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_guest_login srv_count serror_rate srv_serror_rate rerror_rate srv_rerror_rate diff_srv_rate srv_diff_host_rate dst_host_count dst_host_same_srv_rate dst_host_diff_srv_rate dst_host_same_src_port_rate dst_host_srv_diff_host_rate dst_host_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate

Step2 with my hands Support of unsupervised selection What are my constraints Correlation Correlation = same information ROC curves

ROC curves Straight Lines = Uninformative

Correlations over 0.85 Correlation Matrix

After ROC and Correlation 42 to 35 features Selected features @ATTRIBUTE duration REAL @ATTRIBUTE protocol_type REAL @ATTRIBUTE service REAL @ATTRIBUTE flag REAL @ATTRIBUTE src_bytes REAL @ATTRIBUTE dst_bytes REAL @ATTRIBUTE land REAL @ATTRIBUTE wrong_fragment REAL @ATTRIBUTE urgent REAL @ATTRIBUTE hotreal @ATTRIBUTE num_failed_logins REAL @ATTRIBUTE logged_in REAL @ATTRIBUTE num_compromised REAL @ATTRIBUTE root_shell REAL @ATTRIBUTE su_attempted REAL @ATTRIBUTE num_file_creations REAL @ATTRIBUTE num_shells REAL @ATTRIBUTE num_access_files REAL @ATTRIBUTE num_outbound_cmdsreal @ATTRIBUTE is_host_login REAL @ATTRIBUTE is_guest_login REAL @ATTRIBUTE count REAL @ATTRIBUTE srv_count REAL @ATTRIBUTE rerror_rate REAL @ATTRIBUTE same_srv_rate REAL @ATTRIBUTE diff_srv_rate REAL @ATTRIBUTE srv_diff_host_rate REAL @ATTRIBUTE dst_host_count REAL @ATTRIBUTE dst_host_diff_srv_rate REAL @ATTRIBUTE dst_host_same_src_port_rate REAL @ATTRIBUTE dst_host_srv_diff_host_ratereal

Naive Bayes after Supervised Filtering === Cross-Val Results === Correctly Classified Instances 117594 93.3486 % Incorrectly Classified Instances 8379 6.6514 % a b <-- classified as 64341 3002 a = normal 5377 53253 b = anomaly Clues of divergency of test datas! Overfitting! === Test Set Results === Correctly Classified Instances 16811 74.5697 % Incorrectly Classified Instances 5733 25.4303 % a b <-- classified as 9040 671 a = normal 5062 7771 b = anomaly

Final Atributes 42 to 29 features Backward selec: -11 ROC+Corre. Mat: -2 Selected features protocol_type service flag land urgent hot num_compromised root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_guest_login srv_count serror_rate srv_rerror_rate diff_srv_rate srv_diff_host_rate dst_host_count dst_host_same_srv_rate dst_host_diff_srv_rate dst_host_same_src_port_rate dst_host_srv_diff_host_rate dst_host_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate

Step3: Model Learning Random Forests Neural Nets SVM with RBF Naive Bayes

Step3: Model Learning Random Forest Enhanced to overfitting No X-Val requirement === Summary === Correctly Classified Instances 18503 82.0751 % Incorrectly Classified Instances 4041 17.9249 % === Confusion Matrix === a b <-- classified as 9022 689 a = normal 3352 9481 b = anomal Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. Proceedings of the 23rd international conference on Machine learning - ICML 06, 161 168. doi:10.1145/1143844.1143865

Step3: Model Learning Neural Nets 10% of train data (# classes+# attributes)/2 = # hidden units Optimized Parameters Momentum = 0.1 Learning rate = 0.2 Decay Long, so long to train... Accuracy: 84.08%

Step3: Model Learning SVM libsvm with RBF kernel Numeratize nominal values. Long training time PerformanceVector: 75.78% accuracy: ConfusionMatrix: True: anomaly normal anomaly:10342 2969 normal: 2491 6742 Hsu, C., Chang, C., & Lin, C. (2003). A practical guide to support vector classification, 1(1), 1 16. Retrieved from https://www.cs.sfu.ca/people/faculty/teaching/726/spring11/svmguide.pdf

Step3 Naive Bayes Probabilistic approach Analitical results Also gives probabilistic rules GOTCHA! PerformanceVector: accuracy: 89.46% ConfusionMatrix: True: anomaly normal anomaly:12257 1800 normal: 576 7911

Step4: Rule Induction Decision Tree + Naive Bayes

Step4: Backward Selection + Grid Search+Decision Tree => 42 to 32 feats 85.50% accuracy

Tree Over Overview

is_host_login = 0 flag = OTH protocol_type = tcp flag = REJ protocol_type = tcp land = 0 flag = RSTOS0 protocol_type = tcp land = 0\ flag = RSTOS0: anomaly {normal=0, anomaly=10} flag = RSTR protocol_type = tcp land = 0 flag = S0 protocol_type = tcp logged_in = 0 flag = S1 protocol_type = tcp land = 0 logged_in = 0 logged_in = 1: normal {normal=31, anomaly=0} flag = S2 protocol_type = tcp land = 0 is_guest_login = 0 flag = S3 protocol_type = tcp land = 0 logged_in = 1 flag = SF land = 0 protocol_type = icmp logged_in = 0 protocol_type = tcp hot > 1.500 hot 1.500 protocol_type = udp logged_in = 0 flag = SH: anomaly {normal=0, anomaly=28} Top Nodes Most discriminative attributes is_host_login...at that particular snapshot flag Package header flag protocol_type Protocol of connetion

Step4: Naive Bayes Accuracy 89.43% Exampe Results PROTOCOL EFFECT

Step4: Naive Bayes Accuracy 89.43% # COMPROMISE CONDITIONS İf error packages go up, chance of Fraud incrases

Step4: Naive Bayes Accuracy 89.43% FLAGS If so much Sync0 flag, increased prob of fraud

Gotchas Naive Bayes Poor Machine + Less Effort + Naive Bayes = Best Model Best indicators of fraud Protocols (monitoring icmp packages) Log compromised conditions (larger # is secure) Log SNY packages.

What could or will be done? Test selected features with other algorithms as well Craft a new algorithm for divergency Wire a software and see the real time perofrmance. Run a accociation algorithm.

ANY QUESTION & COMMENT & SAYING & WORD & SOMETHING?