NETWORK FAULT DETECTION - A CASE FOR DATA MINING

Similar documents
CS145: INTRODUCTION TO DATA MINING

Performance Analysis of Data Mining Classification Techniques

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Classification Part 4

Classification. Instructor: Wei Ding

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

CS249: ADVANCED DATA MINING

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

Artificial Intelligence. Programming Styles

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

COMBINING NEURAL NETWORKS FOR SKIN DETECTION

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

Business Club. Decision Trees

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Weka ( )

CS4491/CS 7265 BIG DATA ANALYTICS

Dynamic Clustering of Data with Modified K-Means Algorithm

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

UNSUPERVISED LEARNING FOR ANOMALY INTRUSION DETECTION Presented by: Mohamed EL Fadly

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Pattern recognition (4)

A Program demonstrating Gini Index Classification

Global Journal of Engineering Science and Research Management

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

International Journal of Computer Techniques Volume 4 Issue 4, July August 2017

Analyzing Outlier Detection Techniques with Hybrid Method

Evaluating Classifiers

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

Evaluating Classifiers

DATA MINING OVERFITTING AND EVALUATION

MEASURING CLASSIFIER PERFORMANCE

Data Mining: STATISTICA

Machine Learning: Symbolische Ansätze

Information Management course

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

List of Exercises: Data Mining 1 December 12th, 2015

Part III: Multi-class ROC

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence

Intrusion detection in computer networks through a hybrid approach of data mining and decision trees

A Comparative Study of Selected Classification Algorithms of Data Mining

Data Mining Concepts & Techniques

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Optimized Intrusion Detection by CACC Discretization Via Naïve Bayes and K-Means Clustering

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining for Web Personalization

Classification and Regression

EVALUATIONS OF THE EFFECTIVENESS OF ANOMALY BASED INTRUSION DETECTION SYSTEMS BASED ON AN ADAPTIVE KNN ALGORITHM

Regression Error Characteristic Surfaces

Ester Bernadó-Mansilla. Research Group in Intelligent Systems Enginyeria i Arquitectura La Salle Universitat Ramon Llull Barcelona, Spain

Further Thoughts on Precision

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

CSI5387: Data Mining Project

Why MultiLayer Perceptron/Neural Network? Objective: Attributes:

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Nächste Woche. Dienstag, : Vortrag Ian Witten (statt Vorlesung) Donnerstag, 4.12.: Übung (keine Vorlesung) IGD, 10h. 1 J.

SNS College of Technology, Coimbatore, India

The Relationship Between Precision-Recall and ROC Curves

PRIVACY PRESERVING IN DISTRIBUTED DATABASE USING DATA ENCRYPTION STANDARD (DES)

PASS EVALUATING IN SIMULATED SOCCER DOMAIN USING ANT-MINER ALGORITHM

Network Traffic Measurements and Analysis

Fault Identification from Web Log Files by Pattern Discovery

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

Classification Using Decision Tree Approach towards Information Retrieval Keywords Techniques and a Data Mining Implementation Using WEKA Data Set

An Enhanced K-Medoid Clustering Algorithm

CISC 4631 Data Mining

B-kNN to Improve the Efficiency of knn

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

2. On classification and related tasks

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

K- Nearest Neighbors(KNN) And Predictive Accuracy

CS 584 Data Mining. Classification 3

NMLRG #4 meeting in Berlin. Mobile network state characterization and prediction. P.Demestichas (1), S. Vassaki (2,3), A.Georgakopoulos (2,3)

Model s Performance Measures

ECLT 5810 Evaluation of Classification Quality

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

The Role of Biomedical Dataset in Classification

Chuck Cartledge, PhD. 23 September 2017

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Artificial Neural Networks (Feedforward Nets)

CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE

AMOL MUKUND LONDHE, DR.CHELPA LINGAM

Trademark Matching and Retrieval in Sport Video Databases

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Anomaly Detection on Data Streams with High Dimensional Data Environment

Transcription:

NETWORK FAULT DETECTION - A CASE FOR DATA MINING Poonam Chaudhary & Vikram Singh Department of Computer Science Ch. Devi Lal University, Sirsa ABSTRACT: Parts of the general network fault management problem, namely, fault detection, isolation, and diagnosis has been taken up in this communication. A model has been proposed to cluster the network fault data using k-mean algorithm followed by its classification through C4.5 algorithm. Side by side the clustered data are used for training using neural network. The proposed model results into betterment in terms of classification of large data set, decreased network faults, enhanced accuracy. Keywords: K-mean, self organizing maps, data mining, J48, C4.5. 1. INTRODUCTION Modern mobile communication networks are tremendously composite systems, usually capable of limited self-diagnosis. The self-diagnostic feature generates huge amount of data in terms of status and alarm messages (Fawcett & Provost, 1997). Alarm signals may be false (called false positives) or true. Accordingly, they need to be processed automatically in order to identify true network faults in a timely manner so as to avoid performance degradation of the network. The proactive response is essential to maintaining the reliability of the network. Sheer volume of the data and because a single fault may cause cascading effects, seemingly unrelated renders the network fault isolation a quite difficult task. This is where data mining comes in to picture in generating rules for identifying and classifying network faults (Han, J. et al., 2002). 1.1 Mobile network faults Mobile network fault can be defined as an abnormal operation or defect at the component, equipment, or sub-system level that is significantly degrades performance of an active entity in the network or disrupts communication. Each and every error is not faults as protocols can mostly handle them. Mostly faults may be indicated by an abnormally high error rate. The fault can be defined as an inability of an item to perform a required function (a set of processes defined for purpose of achieving a specified objective), excluding that inability due to preventive maintenance, lack of external resources, or planned actions.

Researchers and practitioners and don t agree on a generally accepted description of what constitutes behaviour of a normal mobile network fault (Hajji, et al., 2001; Hajji & Far, 2001). However, majority of the researchers tends to characterize the network faults by 1) transient performance degradation, 2) high error rates, 3) loss of service provision to the customers (i.e., loss of signal loss of connection, etc), delay in delivery of services and getting connectivity. The main causes of network faults differ from network to network. Managing complex hardware and software systems has always been a difficult task. The Internet and the proliferation of web-based services have increased the importance of this task, while aggravating the problem (faults) in at least four ways (Meira, 1997; Thottan & Ji, 1998; Hood & Ji, 1997; Lazar et al., 1992). In the present work, a model of data mining tool has been designed and developed for network fault detection in mobile communication. To model has been implemented in a simulated environment on Ericson data set of the past history of network fault in mobile communication. Typical data mining techniques of classification, clustering, and training have been used to produce results that have been analysed to determine parameter dependencies and various network anomalies. 2. PROPOSED MODEL Figure 1. Proposed model showing blocks therein:

Figure 1 above describes the five-block model proposed in this research endeavor. Working of various building blocks of the models has been described in the following subsections. 2.1 Fault Detection Data A Datasheet from Ericson has been taken that consists of the various problems caused in the network and the reason due to which the problem is caused. The data consists is large, still to enhance the data some entrees are replicated. The data set consists of the various alarm and their child alarms. 2.2 K-mean Clustering The C4.5 algorithm is suitable to classify the small dataset. That s why k-mean clustering is used to arrange the data in the form of clusters. K-mean clustering is performed as 1. Determine the centroid coordinates 2. Determine the distance of each object from the centroid by using the formulae 1. 3. Group the object based on minimum distance (find the closest centroid) 2.3 Training using neural network Training has been applied to the each cluster.20% of the total trained data of each cluster has been using the self organizing maps. 2.4 C4.5 Classification The classification is performed on each cluster by using the c4.5 decision tree algorithm. The attribute selection is on the basis of the training data. The steps involved in C4.5 classification are: 1. create a node N 2. if samples are all of the same class C then return N as a leaf node labeled with the class C; 3. if attribute-list is empty then return N as a leaf node labeled with the most common class in samples;//majority voting 4. select test-attribute, the attribute among attribute-list on the basis of trained data. 5. label node N with test-attribute;

6. for each known value ai of test-attribute; grow a branch from node N for the condition test-attribute = ai; 7. let si be the set of samples in samples for which testattribute = ai; partition 8. if si is empty then attach a leaf labeled with the most common class in samples; else attach the node returned by Generate_decision_tree (si, attribute-listtest-attribute); 2.5 Data Analysis The classification data has been analysed on the basis of Tp rate, Fp rate, and accuracy (precision, recall, and ROC) performance metrics for estimating the accuracy of a given classification model. 2.5.1 Accuracy (precision and recall) The general percentage accuracy as a performance measure has been proven to be misleading. For example, a classifier that labels all regions as the majority class will achieve an accuracy of 94%, because 96% of the majority may belong to that region. Meanwhile, the classifier may have incorrectly classified some of the minority class instance as the majority because of the bias nature of the dataset but may appear to be accurate. It is therefore imperative to compare the accuracy using an alternative method - Precision and Recall. (1) (2) Where, TP, TN, FP, and FN are as represented in the confusion matrix. Precision in this context refers to the actual percentage of responses to mails that were predicted by the classification model, which translates into the returns on cost of mailing. The Recall, on the other hand, measures the percentage of customers that were identified and needed to be targeted. 2.5.2 Accuracy (ROC analysis) An alternative method that was used to evaluate classifier performance is the Receiver Operating Characteristic (ROC) analysis (Flach and Gamberger, 2001). It compares visually the performance of the classifier across the entire range of probabilities. It shows the trade-off

between the false-positive rate on the horizontal axis of a graph and the true-positive rate on the vertical axis. From the values obtained from the confusion matrix above, the true-positive rate and false positive rate could be defined as equations (3) and (4) respectively (3) (4) As a standard method for evaluating classifiers, the primary advantage of ROC curve is that they are used to evaluate the performance of a classifier independent of the naturally occurring class distribution or error cost. 3 RESULTS Ericson data set has been used for comparing the performance of proposed tool vis-à-vis J48 and MLP tools available in the public domain. Tables 1, 2 and 3 list the results of performance of three tools when run on the same test data. Table 1: Parameter Analysis using J48

Table 2: Parameter Analysis using MLP Table 3: Parameter Analysis using new Algorithm 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 TP Rate FP Rate Precision J48 MLP New algorithm Figure 2. TP rate, FP rate and Precision of J48, MLP and proposed algorithm.

0.6 0.5 0.4 0.3 0.2 0.1 J48 MLP New Algorithm 0 Recall F-Measure ROC Area Figure 3. Recall, F-Measure, ROC area of J48, MLP and proposed algorithm. 4. CONCLUSION A new model has been proposed and implemented in the WEKA environment. The results of proposed model have been compared with two existing techniques, namely, J48 and MLP. All the three techniques were fed the same data set from Ericson. Concepts of k-mean clustering, neural network training and J4.5 classification have been used in the proposed model. From the results contained in the above listed tables it can be made out that proposed tool yields lesser true-positive rates of 0.25 against the 0.458 and 0.417 of J48 and MLP respectively. Similarly, lesser false-positive rates of 0.23 have been yielded against the 0.458 and 0.422 of J48 and MLP respectively. As far precision rate goes the performance of proposed tool (0.362) is comparable to MLP (0.365) whereas it lesser than J48 (0.21). Recall rate of proposed tool (0.25) is far less than both J48 (0.458) and MLP (0.417). REFERENCES Lazar, A.; Wang, W. and Deng, R., Models and algorithms for network fault detection and identification: A Review, Proceedings of IEEE ICC, Singapore, pp.999-1003, November 1992.

Provost, F., Fawcett, T., and Kohavi, R., The Case Against Accuracy Estimation For Comparing Classifiers, in Proceedings of the 15th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, 1998. Provost, F. and Fawcett, T., Robust Classification for Imprecise Environments, Machine Learning 42(3), p 203-231, 2001. Flach, P. and Gamberger, D., Subgroup Evaluation And Decision Support For A Direct Mailing Marketing Problem, Aspects of Data Mining, Decision Support and Meta- Learning, [Online] available from: http://www.informatik.unifreiburg.de/~ml/ecmlpkdd/ws- Proceedings/w04/paper8.pdf, 2001. Meira, D. M., A Model for Alarm Correlation in Telecommunications Networks, PhD Thesis, Federal University of Minas Gerais, Belo Horizonte, Brazil, Nov. 1997. Thottan, M. and Ji, C., Proactive Anomaly Detection Using Distributed Intelligent Agents IEEE Network, Sept./Oct. 1998. Hood, C. S. and Ji, C., Proactive Network Fault Detection, Proceedings of the IEEE INFOCOM, pp. 1139-1146, Kobe, Japan, April 1997. Hajji, B. and Far, B. H., Continuous Network Monitoring for Fast Detection of Performance Problems, Proceedings of 2001 International Symposium on Performance Evaluation of Computer and Telecommunication Systems, July 2001. Fawcett, T. and Provost, F., Adaptive fraud detection. Data Mining and Knowledge Discovery 1997; 1(3):291-316. Han, J., Altman, R. B., Kumar, V., Mannila, H. and Pregibon, D., Emerging scientific applications in data mining. Communications of the ACM 2002; 45(8): 54-58. Hajji, B., Far, B. H. and Cheng, J., Detection of Network Faults and Performance Problems, Proceedings of the Internet Conference, Osaka, Japan, Nov. 2001.