Clustering of Windows Security Events by means of Frequent Pattern Mining

Similar documents
Hybrid Feature Selection for Modeling Intrusion Detection Systems

Intrusion Detection by Combining and Clustering Diverse Monitor Data

Security Audit Trail Analysis Using Inductively Generated Predictive Rules

Review on Data Mining Techniques for Intrusion Detection System

9. Conclusions. 9.1 Definition KDD

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

A Two Stage Zone Regression Method for Global Characterization of a Project Database

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

International Journal of Scientific & Engineering Research, Volume 4, Issue 7, July-2013 ISSN

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence

Frequent Pattern Mining for Kernel Trace Data

Performance Analysis of Data Mining Classification Techniques

Methods for Detecting Important Events and Knowledge from Data Security Logs Risto Vaarandi CCD COE, Tallinn, Estonia

Dynamic Clustering of Data with Modified K-Means Algorithm

An Improved Apriori Algorithm for Association Rules

A multi-step attack-correlation method with privacy protection

A Rule-Based Intrusion Alert Correlation System for Integrated Security Management *

An advanced data leakage detection system analyzing relations between data leak activity

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.

FUZZY DATA MINING AND GENETIC ALGORITHMS APPLIED TO INTRUSION DETECTION. Abstract

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Reduce convention for Large Data Base Using Mathematical Progression

Mining Frequent Patterns with Counting Inference at Multiple Levels

Customer Clustering using RFM analysis

Technical Aspects of Intrusion Detection Techniques

A S T U D Y I N U S I N G N E U R A L N E T W O R K S F O R A N O M A L Y A N D M I S U S E D E T E C T I O N

Application of the Generic Feature Selection Measure in Detection of Web Attacks

Cluster Analysis. Angela Montanari and Laura Anderlucci

Data Mining for Improving Intrusion Detection

Knowledge Discovery and Data Mining

arxiv: v1 [cs.db] 7 Dec 2011

The Forensic Chain-of-Evidence Model: Improving the Process of Evidence Collection in Incident Handling Procedures

McPAD and HMM-Web: two different approaches for the detection of attacks against Web applications

A trace-driven analysis of disk working set sizes

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

Intrusion Detection Based On Clustering Algorithm

Kanban Size and its Effect on JIT Production Systems

Developing the Sensor Capability in Cyber Security

ADAPTIVE NETWORK ANOMALY DETECTION USING BANDWIDTH UTILISATION DATA

I. INTRODUCTION II. RELATED WORK.

Integration of information security and network data mining technology in the era of big data

HIPAA Controls. Powered by Auditor Mapping.

K+ Means : An Enhancement Over K-Means Clustering Algorithm

Preemptive PREventivE Methodology and Tools to protect utilities

Intrusion Detection System

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

A study on fuzzy intrusion detection

A mining method for tracking changes in temporal association rules from an encoded database

A New Method For Forecasting Enrolments Combining Time-Variant Fuzzy Logical Relationship Groups And K-Means Clustering

Data Mining: An experimental approach with WEKA on UCI Dataset

Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data

Redefining and Enhancing K-means Algorithm

Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu

Last time. Security Policies and Models. Trusted Operating System Design. Bell La-Padula and Biba Security Models Information Flow Control

Information mining and information retrieval : methods and applications

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

732A54/TDDE31 Big Data Analytics

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

From Data to Actionable Knowledge: Applying Data Mining to the Problem of Intrusion Detection

Time Series Clustering: A Superior Alternative for Market Basket Analysis

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

NETWORK FAULT DETECTION - A CASE FOR DATA MINING

Method for security monitoring and special filtering traffic mode in info communication systems

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

Software Architecture Recovery based on Dynamic Analysis

KBSVM: KMeans-based SVM for Business Intelligence

Effective Intrusion Type Identification with Edit Distance for HMM-Based Anomaly Detection System

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data

Flowzilla: A Methodology for Detecting Data Transfer Anomalies in Research Networks. Anna Giannakou, Daniel Gunter, Sean Peisert

Temporal Weighted Association Rule Mining for Classification

1. INTRODUCTION. AMS Subject Classification. 68U10 Image Processing

Means for Intrusion Detection. Intrusion Detection. INFO404 - Lecture 13. Content

Mining of Web Server Logs using Extended Apriori Algorithm

ON HANDLING REPLAY ATTACKS IN INTRUSION DETECTION SYSTEMS A. M. Sokolov, D. A. Rachkovskij

Clustering Documents in Large Text Corpora

The Application of K-medoids and PAM to the Clustering of Rules

Change Analysis in Spatial Data by Combining Contouring Algorithms with Supervised Density Functions

An Apriori-like algorithm for Extracting Fuzzy Association Rules between Keyphrases in Text Documents

DIVERSITY-BASED INTERESTINGNESS MEASURES FOR ASSOCIATION RULE MINING

CS Review. Prof. Clarkson Spring 2017

A SYSTEM FOR DETECTION AND PRVENTION OF PATH BASED DENIAL OF SERVICE ATTACK

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Optimal Clustering and Statistical Identification of Defective ICs using I DDQ Testing

Denial of Service (DoS) Attack Detection by Using Fuzzy Logic over Network Flows

Statistical Databases: Query Restriction

The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria Astuti

CYSE 411/AIT 681 Secure Software Engineering Topic #3. Risk Management

DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE

Standard: Event Monitoring

The Application of Artificial Neural Networks to Misuse Detection: Initial Results

Intrusion Detection in Containerized Environments

GFI EventsManager 8 ReportPack. Manual. By GFI Software Ltd.

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Optimized Intrusion Detection by CACC Discretization Via Naïve Bayes and K-Means Clustering

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

6. Dicretization methods 6.1 The purpose of discretization

INTRUSION DETECTION SYSTEM USING BIG DATA FRAMEWORK

Transcription:

Clustering of Windows Security Events by means of Frequent Pattern Mining Rosa Basagoiti 1, Urko Zurutuza 1, Asier Aztiria 1, Guzmán Santafé 2 and Mario Reyes 2 1 Mondragon University, Mondragon, Spain {rbasagoiti;uzurutuza;aaztiria}@eps.mondragon.edu 2 Grupo S21sec Gestión S.A., Orcoyen, Spain {gsantafe;mreyes}@s21sec.com Abstract. This paper summarizes the results obtained from the application of Data Mining techniques in order to detect usual behaviors in the use of computers. For that, based on real security event logs, two different clustering strategies have been developed. On the one hand, a clustering process has been carried out taking into account the characteristics that define the events in a quantitative way. On the other hand, an approach based on qualitative aspects has been developed, mainly based on the interruptions among security events. Both approaches have shown to be effective and complementary in order to cluster security audit trails of Windows systems and extract useful behavior patterns. Key words: Windows security event analysis, data mining, frequent pattern mining, intrusion detection, anomaly detection 1 Introduction The idea of discovering behavioral patterns from a set of event logs in order to detect unusual behavior or malicious events is not novel. In fact, the idea came up in the 80s when James P. Anderson, in a seminal work in the area of Intrusion Detection Systems [1], suggested that the common behavior of a user could be portrayed analyzing the set of event logs generated during his/her use of computer. Thereby, unusual events, out of such usual behavior could be considered as attacks or at least as unusual. There are many works in this sense, but most of them have been developed considering Unix systems. This paper focuses on events produced by Windows operative systems. The complexity of such systems is even bigger due to the large amount of data they usually generate. In this work, different experiments have been carried out considering two different approaches. On the one hand, we have created clusters based on characteristics which summary the activity from a quantitative point of view. The second approach tries to find out logical clusters analyzing the interruptions among events. The reminder of this paper is organized as follows. Section 2 provides a literature review of different tools and approaches when performing the analysis

2 R. Basagoiti et al. of log data. In Section 3 we analyse the nature of the problem and we define some aspects to be considered. Section 4 describes the experiments and the results we have obtained. Finally, Section 5 provides some conclusions and ongoing challenges. 2 Related Work The research in Intrusion Detection began in the 1980s when Anderson suggested that the normal behavior of a user could be characterized analyzing his/her usual set of event logs. Since then, the area has attracted a significant number of researchers. The first application to detect unusual events or attacks was named IDES (Intrusion Detection Expert System) and it was developed by Dorothy Denning [2]. The basic idea of such a system was to monitor the normal activity in a mainframe and based on those activities define a set of rules which would allow the detection of anomalies. It is worth mentioning that currently not only the core of the problem keeps being the same, but the complexity of the systems has increased considerably. Whereas Denning s approach suggested to analyze the event logs of a mainframe where the users were connected to, currently a system is composed by a lot of servers and workstations where each one creates its own event logs. More systems that used data mining algorithms on event logs were proposed, but all them were based on centralized Unix events. In [3] a method for discovering temporal patterns in event sequences was proposed. Debar et al. proposed a system which could analyze the behavior of user activity using neural networks [4]. Neural networks were also used for anomaly detection based on Solaris BSM (Basic Security Module) audit data [5]. Lee and Stolfo used in [6] audit data from Unix machines to create behavior patterns using association rules and frequent episode mining, this way a set of events that occurred in a given time window could be discovered. In [7] Lane investigated the use of Hidden Markov Models for user pattern generation. The source of the event logs used turns as the main difference with our proposed work. Almost the 90% of the population uses Windows systems, and the events are stored in each host. The complexity of centralizing and analyzing this information increases significantly. Also, our approach focuses on discovering the behavior of the hosts, and not the users related to them. This way we do not focus only on the usage patterns for intrusion detection, but more on any anomalous behavior that could happen (i.e. misconfigurations). In order to allow the centralization of all this information and make easier the use of it, Security Information Management (SIM) tools have been developed. Currently, there are many applications developed with the purpose of detecting unusual behaviors. Tools such as Tripwire 3, Samhain 4, GFI EventsManager 5 3 Tripwire: http://sourceforge.net/projects/tripwire/ 4 Samhain: http://www.samhain.org/ 5 GFI Events Manager: http://www.gfi.com/es/eventsmanager/

Title Suppressed Due to Excessive Length 3 and specially OSSEC 6 and Cisco MARS (Cisco Security Monitoring, Analysis, and Response System) 7 are an example of it. Nevertheless, only GFI Events- Manager, OSSEC and Cisco MARS can be used in Windows environments and their strategies to analyze need to be improved. These tools, except Cisco MARS, are mainly focused on monitoring modifications in configuration, administration actions, identification of system errors and suspicious security problems. But, neither of them has the ability to generate sequential models which allow to detect unusual events. In this sense, different approaches have tried to discover the correlation between events [8]. Even some of them have worked with summarized data [9]. Specific tools for mining event logs have also been developed [10]. Other options that have been studied are the use of techniques used in temporal series mining [11] or the use of techniques for mining frequent itemsets [12]. It is clear the need of a system which clusters logically the security event logs generated in Windows systems. Therefore in the following sections we describe an approach to classify and correlate such events so that they can be used for further applications. 3 Analysis of Windows security event logs Windows classifies the events in different categories that are stored in independent records, such as System Registry, Application Registry, DNS Registry and Security Registry. This paper focuses on the events stored in the security registry, such as session logons or changes of privileges. It can be activated from the Administrator of domain users (NT) or security guidelines (W2K, W2K3) and it is available in all the versions of Windows Professional and Server. Each event contains information like type of event, date and time information, event source (the software that has registered the event), category, event that has been produced (event ID), user who has produced and station where the event has occurred. Finally, Windows allows to define nine different categories related to security events. Account logon events: This event defines the authentication of a user from the point of view of the system. A single event of this type is not very meaningful but if there are many attempts in a short period of time, it can mean a scan activity or brute force attack. Account management: Activity related to the creation, management and delete of individual user accounts or groups of users. Directory service access: Access to any object that contains System Access Control Lists (SACL). Logon events: User authentication activity coming from local station as well as from the system that triggered the activity in a network. Object access: Access to file system and objects of the registry. It provides an easy to use tool to register changes in sensible files. 6 OSSEC: http://www.ossec.net/ 7 Cisco MARS: http://www.cisco.com/en/us/products/ps6241/

4 R. Basagoiti et al. Policy changes: Changes in the access policy and some other modifications. Privilege use: Windows allows to define granular permissions to carry out specific tasks. Process tracking: It generates detailed information about when a process starts and finishes or when the programs are activated. System events: It registers information that affects the integrity of the system. In this work, we are going to consider events generated by 4 different Domain Controllers (DC) during 4 days. From this point on, these servers will be named as Alfa, Beta, Gamma and Delta. Table 1 shows the number of events generated by each station each day. It is worth mentioning that the Gamma server generates much more events than the rest of the DCs. Moreover, the more events the system generates, more complex is their analysis. That is why the data mining techniques seem a promising approach for this type of data. Table 1. Number of events to be analysed in the experiment Day 1 Day 2 Day 3 Day 4 Total Gamma 4.811.036 2.957.733 3.767.927 1.085.619 12.622.315 Beta 499.589 881.758 876.110 895.249 3.152.706 Delta 77 66 78 105 326 Alfa 1.565.283 1.492.202 1.540.150 1.996.107 6.593.742 4 Clustering Event Sources In this section we are going to describe the experiments carried out using Windows event logs. For that, we have followed the usual steps suggested in any Data Mining process [13]. 4.1 Learning the application domain The event logs have some special features that have to be taken into account in the clustering process. For that, firstly, the dataset is analyzed, extracting descriptive statistics of each attribute. Statistics only show the number and the percentage of different values for each attribute. Usefulness of each attribute was defined by the distribution of its values. All those attributes where more than 80% of the events belonged to the same value were ruled out. Those attributes that were statistically dependant on any other actions were ruled out too (for instance Message vs EventID). After analyzing the data we realized that although there were 22,369,089 events, the number of different type of events (different EventID-s) was 28. We decided to analyze the events generated by each server, ruling out all the attributes except Workstation name, Event ID, User ID and Timestamp.

Title Suppressed Due to Excessive Length 5 4.2 Feature Selection The attribute Event ID is the key feature when it comes to carry out the analysis. It means that the statistics that are going to be used as input will be classified based on such a feature. This step of the process is critical and may influence directly the results we obtain. Statistics are proposed as those indicators that might be key to express computer behavior based on security event logs. After analyzing the information the following features were identified in order to cluster sources of Windows logs. 1. Number of total events (num id) 2. Number of different types of events (num diff id) 3. Number of different users (num diff user) 4. Most frequent event (freq event 1 ) 5. Second most frequent event (freq event 2 ) 6. Percentage of events equal to the most frequent event (perc event 1 ) 7. Percentage of event equal to the second most frequent event (perc event 2 ) 8. Most frequent event in the most active second (freq event max sec) 9. Most frequent event in the most active minute (freq event max min) 10. Event of the most largest sequence of the same event (long event id) 11. Length of the most largest sequences of the same event (long event num) 4.3 Application of clustering techniques Once the attributes have been selected, two different clustering processes have been carried out. Clustering of statistic data using K-means. Clustering is a data mining technique which groups similar instances based on the similarities of their attributes. The basic idea is to minimize the distance between the instances of the same cluster and maximize the distance between different clusters. There are many different clustering techniques such as hierarchical clustering or partitional clustering. In our case, the simplest approach (K-means) seems to be enough. One particularity of K-means is that it is necessary to give the number of clusters to discover in advance. In this work, with the aim of obtaining patterns of the different machines, this constant is known, i.e. 4 in our case. K-means technique [14] selects K points as initial centroids of the clusters. Then it assignees all instances to the closest centroid and it re-computes the centroid of each cluster. This process is repeated until the centroids of clusters remain in the same position. We have applied such a technique to the data collected from different events and summarized in Table 1. We know in advance that the first four instances belong to events occurred during four days in the station named Alfa, the following four instances belong to Beta station and so on. The application of the K-means technique on the selected attributes (num id, num diff id and long event num in our case) provided as result four clusters, which match with the four servers analyzed.

6 R. Basagoiti et al. Discovering frequent event sequences. So far, we have considered the events as independent events and we have analyzed them from a statistical point of view. The events we are considering in this work are the following ones: 538; User Logoff 540; Successful Network Logon 576; Special privileges assigned to new logon 578; Privileged object operation If we order the events based on their timestamps, we will get a sequence of events, which can be analyzed in different ways. This second approach mainly focuses on the analysis of these 16 different sequences generated by the 4 DCs during 4 days. A sequence of events is a set of nominal symbols which indicates the occurrence of different events. Our work has focused on analysing what events usually interrupt previous events. Let us consider that the system has recorded the following sequence: 540 540 540 538 We could say that in this case, the event 540 (Successful Network Logon) has been interrupted by the event 538 (User Logoff). In that sense, we have considered all the possible interruptions, so that taking into account that we are considering 28 different events, we have generated a 28 x 28 matrix. In that matrix we store how many times an event has interrupted a previous event. Let us consider the example depicted in Figure 2. It means that 2500 times the event 540 has been interrupted by the event 538. Fig. 1. Interruptions matrix The content of such a matrix is represented by means of an array, where the first 28 values define the interruptions of the first event (in this case the event 538 User Logoff). Thus, the first value will mean how many times the 538 event is interrupted by itself (we will consider as 0), the second one how many times it is interrupted by the event 540 (Successful Network Logon), and so on.

Title Suppressed Due to Excessive Length 7 After representing such values in an array, we depicted them in graphics where the graphic Alfa1 shows the interruptions for the Alfa server in the first day, Alfa2 shows the interruptions of the same server in the second day and so on. The following pictures show the series obtained for the stations Alfa and Beta in the first two days. Fig. 2. Day 1 and 2 of Alfa server Fig. 3. Day 1 and 2 of Beta server Looking at the figures we realized that the results for a particular server in different days were very similar. Moreover, the dissimilarities with the rest of the servers could facilitate the clustering process. Thus, taking as starting point the 16 series (Alfa1, Alfa2, Alfa3, Alfa4, Beta1, Beta2, Beta3, Beta4, Gamma1, Gamma2, Gamma3, Gamma4, Delta1, Delta2, Delta3, Delta4) we carried out a clustering process using again the K-means technique. In order to compare and therefore cluster the interruptions, we will need criteria to measure the similarity. Let us consider these two set of interruptions X and Y:

8 R. Basagoiti et al. X = X 1, X 2, X 3,...X n (1) Y = Y 1, Y 2, Y 3,...Y n (2) Similarity between sets of interruptions will be given by the Manhattan distance between them D (X,Y): D(X, Y ) = n X i Y i (3) i=1 Table 2 shows the results of the clustering process. 15 out of 16 series were well classified, misclassifying only one series of the Gamma DC. Table 2. Clustering of frequent event sequences Series number Name of the series Assigned Cluster 1 Alfa 1 2 2 Alfa 2 2 3 Alfa 3 2 4 Alfa 4 2 5 Beta 1 4 6 Beta 2 4 7 Beta 3 4 8 Beta 4 4 9 Gamma 1 1 10 Gamma 2 1 11 Gamma 3 1 12 Gamma 4 2 13 Delta 1 3 14 Delta 2 3 15 Delta 3 3 16 Delta 4 3 5 Conclusions and ongoing challenges Discovering frequent patterns in event logs is the first step to detect unusual behavior or anomalies. Besides proving that it is possible to detect patterns in event logs, different experiments have shown that different servers have different patterns and they can be found out and identified, even in Windows systems. Thus, the experiments carried out at different stages have proved that the same server has very similar patterns during different days. In that sense, these

Title Suppressed Due to Excessive Length 9 experiments have been carried out with few Domain Controllers, so that it would be interesting to validate it with a larger set of servers and workstations. Finally, it is worth to say that these results are work in progress that aims to detect anomalies in security event logs out of analyzing the event sources. References 1. Anderson, J.P.: Computer Security Threat Monitoring and Surveillance. Technical report, Fort Washington (1980) 2. Denning, D. E.: An Intrusion-Detection Model. IEEE transaction on Software Engineering, 13(2):222-232. (1987) 3. Teng, H., Chen, K., Lu, S.: Adaptive real-time anomaly detection using inductively generated sequential patterns. Proceedings of 1990 IEEE Computer Society Symposium on Research in Security and Privacy, Oakland, California, pp.278-84, May 7-9, (1990) 4. Debar, H., Becker, M., Siboni,D.: A Neural Network Component for an Intrusion DetectionSystem. Proceedings, IEEE Symposium on Research in Computer Security and Privacy, pp 240-250, (1992) 5. Endler, D.: Intrusion detection: Applying machine learning to solaris audit data. In Proceedings of the 1998 Annual Computer Security Applications Conference (ACSAC 98), pages 268 279, Los Alamitos, CA, December 1998. IEEE Computer Society, IEEE Computer Society Press. Scottsdale, AZ, (1998) 6. Lee, W., Stolfo, S.: Data Mining Approaches for Intrusion Detection. In Proceedings of the Seventh USENIX Security Symposium (SECURITY 98), San Antonio, TX, January (1998) 7. Lane, T., Brodley, C.E.: Temporal Sequence Learning and Data Reduction for Anomaly Detection. ACM Transactions on Information and System Security, 2:295-331, (1999) 8. Larosa, C., Xiong, L., Mandelberg, K.: Frequent pattern mining for kernel trace data. SAC 08: Proceedings of the 2008 ACM symposium on Applied computing, pp.880 885, Brazil, (2008) 9. Rana, A.Z., Bell, J.: Using event attribute name-value pairs for summarizing log data, AusCERT2007 (2007) 10. Vaarandi, R.: Mining Event Logs with SLCT and LogHound, Proceedings of the 2008 IEEE/IFIP Network Operations and Management Symposium, pp 1071-1074, (2008) 11. Viinikka, J.: Time series modeling for IDS Alert Management, ACM ASIAN Symposium on Information, (2006) 12. Burdick, D., Calimlim, M., Gehrke, J.: A maximal frequent itemset algorithm for transactional databases, IEEE Trans. Knowl. Data Eng. 17(11): 1490-1504 (2005) 13. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11):27-34, (1996). 14. MacQueen, J. B.: Some Methods for classification and Analysis of Multivariate Observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. 1: 281-297, University of California Press. (1967)