Clustering of Windows Security Events by means of Frequent Pattern Mining Rosa Basagoiti 1, Urko Zurutuza 1, Asier Aztiria 1, Guzmán Santafé 2 and Mario Reyes 2 1 Mondragon University, Mondragon, Spain {rbasagoiti;uzurutuza;aaztiria}@eps.mondragon.edu 2 Grupo S21sec Gestión S.A., Orcoyen, Spain {gsantafe;mreyes}@s21sec.com Abstract. This paper summarizes the results obtained from the application of Data Mining techniques in order to detect usual behaviors in the use of computers. For that, based on real security event logs, two different clustering strategies have been developed. On the one hand, a clustering process has been carried out taking into account the characteristics that define the events in a quantitative way. On the other hand, an approach based on qualitative aspects has been developed, mainly based on the interruptions among security events. Both approaches have shown to be effective and complementary in order to cluster security audit trails of Windows systems and extract useful behavior patterns. Key words: Windows security event analysis, data mining, frequent pattern mining, intrusion detection, anomaly detection 1 Introduction The idea of discovering behavioral patterns from a set of event logs in order to detect unusual behavior or malicious events is not novel. In fact, the idea came up in the 80s when James P. Anderson, in a seminal work in the area of Intrusion Detection Systems [1], suggested that the common behavior of a user could be portrayed analyzing the set of event logs generated during his/her use of computer. Thereby, unusual events, out of such usual behavior could be considered as attacks or at least as unusual. There are many works in this sense, but most of them have been developed considering Unix systems. This paper focuses on events produced by Windows operative systems. The complexity of such systems is even bigger due to the large amount of data they usually generate. In this work, different experiments have been carried out considering two different approaches. On the one hand, we have created clusters based on characteristics which summary the activity from a quantitative point of view. The second approach tries to find out logical clusters analyzing the interruptions among events. The reminder of this paper is organized as follows. Section 2 provides a literature review of different tools and approaches when performing the analysis
2 R. Basagoiti et al. of log data. In Section 3 we analyse the nature of the problem and we define some aspects to be considered. Section 4 describes the experiments and the results we have obtained. Finally, Section 5 provides some conclusions and ongoing challenges. 2 Related Work The research in Intrusion Detection began in the 1980s when Anderson suggested that the normal behavior of a user could be characterized analyzing his/her usual set of event logs. Since then, the area has attracted a significant number of researchers. The first application to detect unusual events or attacks was named IDES (Intrusion Detection Expert System) and it was developed by Dorothy Denning [2]. The basic idea of such a system was to monitor the normal activity in a mainframe and based on those activities define a set of rules which would allow the detection of anomalies. It is worth mentioning that currently not only the core of the problem keeps being the same, but the complexity of the systems has increased considerably. Whereas Denning s approach suggested to analyze the event logs of a mainframe where the users were connected to, currently a system is composed by a lot of servers and workstations where each one creates its own event logs. More systems that used data mining algorithms on event logs were proposed, but all them were based on centralized Unix events. In [3] a method for discovering temporal patterns in event sequences was proposed. Debar et al. proposed a system which could analyze the behavior of user activity using neural networks [4]. Neural networks were also used for anomaly detection based on Solaris BSM (Basic Security Module) audit data [5]. Lee and Stolfo used in [6] audit data from Unix machines to create behavior patterns using association rules and frequent episode mining, this way a set of events that occurred in a given time window could be discovered. In [7] Lane investigated the use of Hidden Markov Models for user pattern generation. The source of the event logs used turns as the main difference with our proposed work. Almost the 90% of the population uses Windows systems, and the events are stored in each host. The complexity of centralizing and analyzing this information increases significantly. Also, our approach focuses on discovering the behavior of the hosts, and not the users related to them. This way we do not focus only on the usage patterns for intrusion detection, but more on any anomalous behavior that could happen (i.e. misconfigurations). In order to allow the centralization of all this information and make easier the use of it, Security Information Management (SIM) tools have been developed. Currently, there are many applications developed with the purpose of detecting unusual behaviors. Tools such as Tripwire 3, Samhain 4, GFI EventsManager 5 3 Tripwire: http://sourceforge.net/projects/tripwire/ 4 Samhain: http://www.samhain.org/ 5 GFI Events Manager: http://www.gfi.com/es/eventsmanager/
Title Suppressed Due to Excessive Length 3 and specially OSSEC 6 and Cisco MARS (Cisco Security Monitoring, Analysis, and Response System) 7 are an example of it. Nevertheless, only GFI Events- Manager, OSSEC and Cisco MARS can be used in Windows environments and their strategies to analyze need to be improved. These tools, except Cisco MARS, are mainly focused on monitoring modifications in configuration, administration actions, identification of system errors and suspicious security problems. But, neither of them has the ability to generate sequential models which allow to detect unusual events. In this sense, different approaches have tried to discover the correlation between events [8]. Even some of them have worked with summarized data [9]. Specific tools for mining event logs have also been developed [10]. Other options that have been studied are the use of techniques used in temporal series mining [11] or the use of techniques for mining frequent itemsets [12]. It is clear the need of a system which clusters logically the security event logs generated in Windows systems. Therefore in the following sections we describe an approach to classify and correlate such events so that they can be used for further applications. 3 Analysis of Windows security event logs Windows classifies the events in different categories that are stored in independent records, such as System Registry, Application Registry, DNS Registry and Security Registry. This paper focuses on the events stored in the security registry, such as session logons or changes of privileges. It can be activated from the Administrator of domain users (NT) or security guidelines (W2K, W2K3) and it is available in all the versions of Windows Professional and Server. Each event contains information like type of event, date and time information, event source (the software that has registered the event), category, event that has been produced (event ID), user who has produced and station where the event has occurred. Finally, Windows allows to define nine different categories related to security events. Account logon events: This event defines the authentication of a user from the point of view of the system. A single event of this type is not very meaningful but if there are many attempts in a short period of time, it can mean a scan activity or brute force attack. Account management: Activity related to the creation, management and delete of individual user accounts or groups of users. Directory service access: Access to any object that contains System Access Control Lists (SACL). Logon events: User authentication activity coming from local station as well as from the system that triggered the activity in a network. Object access: Access to file system and objects of the registry. It provides an easy to use tool to register changes in sensible files. 6 OSSEC: http://www.ossec.net/ 7 Cisco MARS: http://www.cisco.com/en/us/products/ps6241/
4 R. Basagoiti et al. Policy changes: Changes in the access policy and some other modifications. Privilege use: Windows allows to define granular permissions to carry out specific tasks. Process tracking: It generates detailed information about when a process starts and finishes or when the programs are activated. System events: It registers information that affects the integrity of the system. In this work, we are going to consider events generated by 4 different Domain Controllers (DC) during 4 days. From this point on, these servers will be named as Alfa, Beta, Gamma and Delta. Table 1 shows the number of events generated by each station each day. It is worth mentioning that the Gamma server generates much more events than the rest of the DCs. Moreover, the more events the system generates, more complex is their analysis. That is why the data mining techniques seem a promising approach for this type of data. Table 1. Number of events to be analysed in the experiment Day 1 Day 2 Day 3 Day 4 Total Gamma 4.811.036 2.957.733 3.767.927 1.085.619 12.622.315 Beta 499.589 881.758 876.110 895.249 3.152.706 Delta 77 66 78 105 326 Alfa 1.565.283 1.492.202 1.540.150 1.996.107 6.593.742 4 Clustering Event Sources In this section we are going to describe the experiments carried out using Windows event logs. For that, we have followed the usual steps suggested in any Data Mining process [13]. 4.1 Learning the application domain The event logs have some special features that have to be taken into account in the clustering process. For that, firstly, the dataset is analyzed, extracting descriptive statistics of each attribute. Statistics only show the number and the percentage of different values for each attribute. Usefulness of each attribute was defined by the distribution of its values. All those attributes where more than 80% of the events belonged to the same value were ruled out. Those attributes that were statistically dependant on any other actions were ruled out too (for instance Message vs EventID). After analyzing the data we realized that although there were 22,369,089 events, the number of different type of events (different EventID-s) was 28. We decided to analyze the events generated by each server, ruling out all the attributes except Workstation name, Event ID, User ID and Timestamp.
Title Suppressed Due to Excessive Length 5 4.2 Feature Selection The attribute Event ID is the key feature when it comes to carry out the analysis. It means that the statistics that are going to be used as input will be classified based on such a feature. This step of the process is critical and may influence directly the results we obtain. Statistics are proposed as those indicators that might be key to express computer behavior based on security event logs. After analyzing the information the following features were identified in order to cluster sources of Windows logs. 1. Number of total events (num id) 2. Number of different types of events (num diff id) 3. Number of different users (num diff user) 4. Most frequent event (freq event 1 ) 5. Second most frequent event (freq event 2 ) 6. Percentage of events equal to the most frequent event (perc event 1 ) 7. Percentage of event equal to the second most frequent event (perc event 2 ) 8. Most frequent event in the most active second (freq event max sec) 9. Most frequent event in the most active minute (freq event max min) 10. Event of the most largest sequence of the same event (long event id) 11. Length of the most largest sequences of the same event (long event num) 4.3 Application of clustering techniques Once the attributes have been selected, two different clustering processes have been carried out. Clustering of statistic data using K-means. Clustering is a data mining technique which groups similar instances based on the similarities of their attributes. The basic idea is to minimize the distance between the instances of the same cluster and maximize the distance between different clusters. There are many different clustering techniques such as hierarchical clustering or partitional clustering. In our case, the simplest approach (K-means) seems to be enough. One particularity of K-means is that it is necessary to give the number of clusters to discover in advance. In this work, with the aim of obtaining patterns of the different machines, this constant is known, i.e. 4 in our case. K-means technique [14] selects K points as initial centroids of the clusters. Then it assignees all instances to the closest centroid and it re-computes the centroid of each cluster. This process is repeated until the centroids of clusters remain in the same position. We have applied such a technique to the data collected from different events and summarized in Table 1. We know in advance that the first four instances belong to events occurred during four days in the station named Alfa, the following four instances belong to Beta station and so on. The application of the K-means technique on the selected attributes (num id, num diff id and long event num in our case) provided as result four clusters, which match with the four servers analyzed.
6 R. Basagoiti et al. Discovering frequent event sequences. So far, we have considered the events as independent events and we have analyzed them from a statistical point of view. The events we are considering in this work are the following ones: 538; User Logoff 540; Successful Network Logon 576; Special privileges assigned to new logon 578; Privileged object operation If we order the events based on their timestamps, we will get a sequence of events, which can be analyzed in different ways. This second approach mainly focuses on the analysis of these 16 different sequences generated by the 4 DCs during 4 days. A sequence of events is a set of nominal symbols which indicates the occurrence of different events. Our work has focused on analysing what events usually interrupt previous events. Let us consider that the system has recorded the following sequence: 540 540 540 538 We could say that in this case, the event 540 (Successful Network Logon) has been interrupted by the event 538 (User Logoff). In that sense, we have considered all the possible interruptions, so that taking into account that we are considering 28 different events, we have generated a 28 x 28 matrix. In that matrix we store how many times an event has interrupted a previous event. Let us consider the example depicted in Figure 2. It means that 2500 times the event 540 has been interrupted by the event 538. Fig. 1. Interruptions matrix The content of such a matrix is represented by means of an array, where the first 28 values define the interruptions of the first event (in this case the event 538 User Logoff). Thus, the first value will mean how many times the 538 event is interrupted by itself (we will consider as 0), the second one how many times it is interrupted by the event 540 (Successful Network Logon), and so on.
Title Suppressed Due to Excessive Length 7 After representing such values in an array, we depicted them in graphics where the graphic Alfa1 shows the interruptions for the Alfa server in the first day, Alfa2 shows the interruptions of the same server in the second day and so on. The following pictures show the series obtained for the stations Alfa and Beta in the first two days. Fig. 2. Day 1 and 2 of Alfa server Fig. 3. Day 1 and 2 of Beta server Looking at the figures we realized that the results for a particular server in different days were very similar. Moreover, the dissimilarities with the rest of the servers could facilitate the clustering process. Thus, taking as starting point the 16 series (Alfa1, Alfa2, Alfa3, Alfa4, Beta1, Beta2, Beta3, Beta4, Gamma1, Gamma2, Gamma3, Gamma4, Delta1, Delta2, Delta3, Delta4) we carried out a clustering process using again the K-means technique. In order to compare and therefore cluster the interruptions, we will need criteria to measure the similarity. Let us consider these two set of interruptions X and Y:
8 R. Basagoiti et al. X = X 1, X 2, X 3,...X n (1) Y = Y 1, Y 2, Y 3,...Y n (2) Similarity between sets of interruptions will be given by the Manhattan distance between them D (X,Y): D(X, Y ) = n X i Y i (3) i=1 Table 2 shows the results of the clustering process. 15 out of 16 series were well classified, misclassifying only one series of the Gamma DC. Table 2. Clustering of frequent event sequences Series number Name of the series Assigned Cluster 1 Alfa 1 2 2 Alfa 2 2 3 Alfa 3 2 4 Alfa 4 2 5 Beta 1 4 6 Beta 2 4 7 Beta 3 4 8 Beta 4 4 9 Gamma 1 1 10 Gamma 2 1 11 Gamma 3 1 12 Gamma 4 2 13 Delta 1 3 14 Delta 2 3 15 Delta 3 3 16 Delta 4 3 5 Conclusions and ongoing challenges Discovering frequent patterns in event logs is the first step to detect unusual behavior or anomalies. Besides proving that it is possible to detect patterns in event logs, different experiments have shown that different servers have different patterns and they can be found out and identified, even in Windows systems. Thus, the experiments carried out at different stages have proved that the same server has very similar patterns during different days. In that sense, these
Title Suppressed Due to Excessive Length 9 experiments have been carried out with few Domain Controllers, so that it would be interesting to validate it with a larger set of servers and workstations. Finally, it is worth to say that these results are work in progress that aims to detect anomalies in security event logs out of analyzing the event sources. References 1. Anderson, J.P.: Computer Security Threat Monitoring and Surveillance. Technical report, Fort Washington (1980) 2. Denning, D. E.: An Intrusion-Detection Model. IEEE transaction on Software Engineering, 13(2):222-232. (1987) 3. Teng, H., Chen, K., Lu, S.: Adaptive real-time anomaly detection using inductively generated sequential patterns. Proceedings of 1990 IEEE Computer Society Symposium on Research in Security and Privacy, Oakland, California, pp.278-84, May 7-9, (1990) 4. Debar, H., Becker, M., Siboni,D.: A Neural Network Component for an Intrusion DetectionSystem. Proceedings, IEEE Symposium on Research in Computer Security and Privacy, pp 240-250, (1992) 5. Endler, D.: Intrusion detection: Applying machine learning to solaris audit data. In Proceedings of the 1998 Annual Computer Security Applications Conference (ACSAC 98), pages 268 279, Los Alamitos, CA, December 1998. IEEE Computer Society, IEEE Computer Society Press. Scottsdale, AZ, (1998) 6. Lee, W., Stolfo, S.: Data Mining Approaches for Intrusion Detection. In Proceedings of the Seventh USENIX Security Symposium (SECURITY 98), San Antonio, TX, January (1998) 7. Lane, T., Brodley, C.E.: Temporal Sequence Learning and Data Reduction for Anomaly Detection. ACM Transactions on Information and System Security, 2:295-331, (1999) 8. Larosa, C., Xiong, L., Mandelberg, K.: Frequent pattern mining for kernel trace data. SAC 08: Proceedings of the 2008 ACM symposium on Applied computing, pp.880 885, Brazil, (2008) 9. Rana, A.Z., Bell, J.: Using event attribute name-value pairs for summarizing log data, AusCERT2007 (2007) 10. Vaarandi, R.: Mining Event Logs with SLCT and LogHound, Proceedings of the 2008 IEEE/IFIP Network Operations and Management Symposium, pp 1071-1074, (2008) 11. Viinikka, J.: Time series modeling for IDS Alert Management, ACM ASIAN Symposium on Information, (2006) 12. Burdick, D., Calimlim, M., Gehrke, J.: A maximal frequent itemset algorithm for transactional databases, IEEE Trans. Knowl. Data Eng. 17(11): 1490-1504 (2005) 13. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11):27-34, (1996). 14. MacQueen, J. B.: Some Methods for classification and Analysis of Multivariate Observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. 1: 281-297, University of California Press. (1967)