A Comparison of Word Frequency and N-Gram Based Vulnerability Categorization Using SOM

Size: px

Start display at page:

Download "A Comparison of Word Frequency and N-Gram Based Vulnerability Categorization Using SOM"

Ruth Walsh
6 years ago
Views:

1 A Comparison of Word Frequency and N-Gram Based Vulnerability Categorization Using SOM Melanie Tupper Supervised by Nur Zincir-Heywood Faculty of Computer Science, Dalhousie University October 2008 Abstract: Network attackers exploit software vulnerabilities on network computers to facilitate successful attacks. Many organizations keep track of the existing software vulnerabilities in the form of vulnerability databases. However, categorizing vulnerabilities is difficult due to the large number of different attributes maintained. In this work we apply a dataclustering algorithm (SOM) to two different representations of information contained in an existing online vulnerability databases. After identifying the more valuable approach for this task, we are able to identify critical vulnerability features inherent in the dataset.

2 Acknowledgements I would like to thank The Computer Research Association s Committee on the Status of Women in Computing Research (CRA-W) and The Natural Sciences and Engineering Research Council of Canada (NSERC) for supporting this research. I would like to thank my mentor, Dr Nur Zincir-Heywood for her inspiration, and my husband, Stewart Hardie, for his encouragement, love, and support. 1

3 Table of Contents Section 1 - Introduction Motivation Overview... 3 Section 2 Related Work... 4 Section 3 - Methodology Data Preprocessing SOM Toolkit... 8 Section 4 - Results Section 5 Conclusions and Future Work References Appendix A: Stop Words Appendix B: Unlabeled U-Matrix Representations Appendix C: Labeled U-Matrix Representation

4 Section1 Introduction 1.1 Motivation Network attackers exploit software vulnerabilities on network computers to facilitate successful attacks. There are many types of software vulnerabilities, with differences in various attributes, including the level of authorization needed for execution, the impact on the target network, the complexity of the exploit, and others. Without software vulnerabilities, attacks would not be possible. Therefore, understanding these vulnerabilities holds the key to configuring secure computer networks. Many organizations keep track of the various existing software vulnerabilities in the form of vulnerability databases. These databases maintain a wealth of information in connection with specific vulnerabilities. Data collected may include when the vulnerability first appeared, which programs or operating systems are affected, what effects a successful exploit would have on the target network, and whether a patch for the vulnerable software is available. However, each different database may record different vulnerability attributes, and categorizing vulnerabilities is difficult due to the large number of different attributes maintained. By applying SOM, a data clustering algorithm, to an online vulnerability databases, we will be able to identify critical vulnerability features inherent in the data. Furthermore, analysis of these features will allow us to propose a standardization strategy for vulnerability classification in the future. 1.2 Overview The following section, Section 2, provides a brief introduction to SOM and n-grams, as well as a review of the current state of vulnerability classification. Section 3 describes each stage of our research and Section 4 summarizes our results. We present our conclusions and offer suggestions for furthering this research in Section 5. 3

5 Section 2 Related Work Data clustering is a classification technique used to group data into subsets based on data commonalities. A Self-Organizing Feature Map (SOM), also commonly referred to as an unsupervised learning network, is a data clustering algorithm that groups the data according to features or categories that are inherent in the dataset [1]. Recent research efforts utilizing the SOM algorithm have presented a wide variety of applications, including document classification [2] and computer network attack behavior categorization [3]. Vulnerability classification is a research area that is currently attracting much attention due to the potential benefits of a standardized vulnerability classification scheme. Li et al. [4] propose a system for standardizing vulnerability categories as a result of applying the SOM algorithm to a word frequency vector representing the text. While word frequency vectors are often used as input to data clustering algorithms, another approach is proving useful in areas of text categorization: n-grams. An n-gram is defined to be a smaller sequence of n items from a larger sequence [5]. In the case of text, items can refer to either characters or words. In our application, we will use n-grams of words. For example, 'ceramics collected by' is a 3-gram appearing in the Google n-gram corpus; each word is considered to be an item. Recent works that have used n-gram vectors as input to the SOM algorithm include a comparison of text clustering [6] and categorization [7]. In this work we compare word vector and n-gram approaches as applied to the problem of classifying vulnerabilities using the SOM algorithm. As mentioned above, there is currently no standardized vulnerability classification system. Most vulnerability databases and vulnerability scanning software includes a proprietary set of vulnerability categories. Vulnerability standardization would provide a platform for future databases and vulnerability scanners, as well as offer security administrators a common frame of reference when considering new and existing vulnerabilities that may affect specific network configurations. 4

6 Section 3 Methodology 3.1 Data Preprocessing Our goal to identify critical vulnerability features prompted us to explore and compare the various online vulnerability databases. Since we want to be able to work with the description of the different vulnerabilities, we sought out the data in a downloadable format. Of the different vulnerability databases that we explored, there is only one that offers a downloadable version of the data: the Common Vulnerabilities and Exposures Database [8]. At this point is it worth noting that CVE differentiates between two different types of vulnerabilities: entry and candidate. Entries are accepted vulnerabilities, whereas candidates are currently under review. For our purposes, we consider only the entries, and not the candidates, and download the corresponding file in XML format. Having the data in XML format allows us to process it easily using simple Unix SED commands. Figure 3.1 shows a sample of the downloaded file before processing. We easily remove all unnecessary data from the file leaving only the text located between the <desc> and </desc> tags. <item type="cve" name="cve " seq=" "> <status>entry</status> <desc>microsoft Data Access Component Internet Publishing Provider and earlier allows remote attackers to bypass Security Zone restrictions via WebDAV requests.</desc> <refs> <ref source="ms" url=" <ref source="ciac" url=" 074.shtml">L-074</ref> <ref source="xf" url=" </refs> </item> Figure 3.1. CVE Entry in XML Format 5

7 Next we remove the punctuation, numbers, and stop words from the text. Numbers and punctuation are, again, removed using simple Unix commands. Stop words, also known as noise words, are removed by writing a java program. In short, the program works by storing the stop words in a data structure then comparing each word in the text to all the words in the data structure. If the word is found, it is discarded, if not, it is written to an output file. The complete list of stop words was obtained online at [9] and can be found in Appendix A. The resulting file is considered to be the text corpus, or corpus. For the purpose of the experiment, we wish to compare two approaches, which we will refer to as word vector and bigram vector approaches. For the word approach, we simply write a java program that stores the list all the words that occur in the corpus and the number of times the word occurs. Since the SOM algorithm does not perform well if more than 3000 vectors are presented to it, we will employ a reduction technique to reduce the word space, which is done by considering the word counts. We choose to discard any word that occurs less than 3 times in the corpus, then write the word to an output file, allowing one word per line. This technique effectively reduces the number of words to allowable limit; the resulting file is known as the word file. For the bigram approach we begin by constructing a master list of all n-grams (with n=2) of words contained in the corpus. This is done using a Perl script written by Ted Pedersen, which can be found online at [10]. Figure 3.2 shows a sample of the output generated by this script. allows<>remote<> remote<>attackers<> allows<>local<> local<>users<> denial<>service<> execute<>arbitrary<> cause<>denial<>

8 The output displays the n-gram component words separated by <>, followed by three numbers. The first number is the frequency count for the n-gram is displayed; the second number indicates how often the first word occurs on the left in any n-grams of the text. Similarly, the last number indicates how many times the second word occurs as the rightmost word in any n-gram in the text. For the purpose of this investigation, we do not require the second and third number. Therefore, to prepare the resulting file of resulting n-grams for use, we remove the diamond characters and the two unused numbers, and limit the bigrams to be considered to ones that occur more than twice in the corpus. This is known as the bigram file. Since the SOM toolkit we will be using requires that the data be in a vector format, we still need to generate such vectors before we are able to run the SOM algorithm on the data. To do this, we write a program to do the following: Determine the number of vulnerabilities, V, represented by the corpus Read the word file into a word array of length N Construct and initialize a 2 dimensional integer array of size V x N For each vulnerability, iterate through the corpus considering each word in sequence Compare the current word to the entries of the word array When a match is found, increment the integer at the corresponding index of the integer array After the entire corpus has been considered, write the integers to a file where each row of the 2 dimensional array is written to 1 line The resulting file is the word vector file. The steps above are repeated for the bigram file to produce a bigram vector file. These vector files represent the frequency of occurrence of each word or bigram and are in the format required by the SOM toolkit. 7

9 3.2 SOM Toolkit In the course of our testing, we first run the SOM algorithm on the bigram vectors using various map sizes and compare the quantization errors of the results to determine whether the smallest acceptable map size that can be used. We then run the SOM algorithm on the word vector file using the optimal map size as determine from the bigram tests. The SOM algorithm implemented by the SOM toolkit consists of four stages as described in [11]: 1. Initialization 2. Training 3. Quantization Error Evaluation 4. Visualization In the initialization stage, random values are assigned to the reference vectors using a command such as: randinit -din bigram.dat -cout bigrm.cod -xdim 6 -ydim 6 -topol hexa -neigh gaussian The above command specifies the input and output files, the map dimensions, lattice type, and neighborhood function type. Map sizes considered are 6 x 6, 10 x 12, 14 x 14, and 18 x 12. The second stage, map training, consists of two phases: the first phase orders the reference vectors, whereas the second phase fine-tunes the vector values. Both phases use a command structured as: vsom -din bigram.dat -cin bigrm.cod -cout bigram.cod -rlen alpha 0.5 -radius 15 The rlen parameter determines the number of training iterations, or epochs. We consider various values of this parameter, including , , and , for each map size. The alpha value specifies the learning rate, whereas the radius determines the neighborhood radius. Since the first phase is meant to be the coarser of the two, we choose alpha to be.5 with a radius of 15. Fine-tuning these values for phase two, we 8

10 choose an alpha value of.04 and a radius of 4. These values were suggested to me by other students who have previously conducted research using the SOM PAK algorithms. The third stage involves evaluating the quantization errors of the various maps by calculating the average error of the entire data sample in the original data file. This is done with the following command: qerror -din bigram.dat -cin bigram.cod For the final stage, visualization, we choose to consider the U-Matrix representations of the resulting maps. After comparing the bigram and word approaches, we choose the approach corresponding to the U-Matrix representation that yields distinguishable clusters. Since the SOM_PAK does not label the nodes of the U-Matrix, we also use an SOM for MatLab toolbox [12] to generate a labeled, colorized version of the preferred U- Matrix. We evaluate the cluster labels, and propose a high-level categorization scheme for the representative vulnerabilities. The map quantization errors and visualizations are considered in the following section. Conclusions and further applications of this work are considered in Section 5. 9

11 Section 4 Results Table 4.1 below shows the quantization error values calculated from the first round of testing. These tests included initialization and training of 4 map sizes for the bigram approach. Table 4.1. Bigram Vector Approach Quantization Error Values Map Size Number of Epochs Quantization Error 6 x x x x x x x x x x x x x x x Based on the above results for the bigram approach, we determine that a 6 x 6 map size is sufficient and suspect 1 million epochs is sufficient as well. For the second round of testing we run three tests using the word approach for 1 million, 3 million, and 5 million epochs. The results are summarized in Table 4.2. Table 4.2. Word Vector Approach Quantization Error Values Map Size Number of Epochs Quantization Error 6 x x x

The quantization error values in Tables 4.1 and 4.2 seem to plateau, which confirms our hypothesis that 1 million epochs is a sufficient number of training iterations for this dataset.

12 The quantization error values in Tables 4.1 and 4.2 seem to plateau, which confirms our hypothesis that 1 million epochs is a sufficient number of training iterations for this dataset. This is further demonstrated below in the U-Matrix representations for 1 million (Figure 4.1) and 5 million epochs (Figure 4.2). The subtle difference between the two figures indicates a plateau in values. A comparison of U-Matrix diagrams for the n-gram and word approaches can be found in Appendix B. Figure 4.1. U-Matrix for N-gram approach with 1 million epochs Figure 4.2. U-Matrix for N-gram approach with 5 million epochs 11

By comparing the error values for the two approaches, we determine that the n-gram approach produced a more valuable result due to the lower quantization error values.

13 By comparing the error values for the two approaches, we determine that the n-gram approach produced a more valuable result due to the lower quantization error values. Having declared the n-gram approach more valuable, we will use the resulting U-Matrix representation after labeling, as shown in Figure 4.3, to suggest high-level vulnerability categories. A larger version of Figure 4.3 can be found in Appendix C. Figure 4.3. Labeled U-Matrix for N-gram approach with 5 million epochs In the figure above, colours closer to the blue end of the spectrum indicate that the values are similar, whereas nodes with colouring from the other end of the spectrum represent values that are less similar. With this in mind, we can identify four prominent node clusters. Examining the bigrams of these node we look for words or bigrams that are the same or similar in meaning. The words or bigrams that occur repeatedly for the four cluster are: (1) remote and quot ; (2) remote attackers and denial service ; (3) local and buffer overflow ; and (4) arbitrary files. 12

14 Section 5 Conclusions and Future Work In this work, the objective was to compare the results of applying the SOM algorithm to two different representations of a dataset for the purpose of identifying vulnerability features inherent in the data. To this end we identified a textual representation of software vulnerabilities and performed various preprocessing tasks until we were left with only a vulnerability description. Using this text, we constructed two different representations of the data, words and n-grams, and presented both to SOM, a data-clustering algorithm. We ran the SOM algorithm for both word and n-gram vector approaches and compared the resulting quantization error values to determine an acceptable map size and number of iterations. Through a process of trial and error, we determined that a map size of 6 x 6 nodes and 1million iterations of training are sufficient for the dataset. We also compare the error values for word and n-gram trails to conclude that the n-gram approach produced more valuable U-Matrix representations. After labeling the n-gram U-Matrix representation, we identified four high-level vulnerability categories. Future work includes analysis of the defining features of the identified high-level categories, which will allow us to propose a standardization strategy for vulnerability classification in the future. Such a classification scheme can be compared to other available categorization taxonomies, including online vulnerability database and vulnerability scanning software. Since this research has many implications related to network vulnerabilities, another possibility for the future is to incorporate vulnerability classification factors based on this research into a novel or existing security metric. A standardized categorization scheme based on this information as described above could also be incorporated into existing vulnerability scanners or applied to existing vulnerability databases. 13

15 References [1] SOM Toolbox: Intro to SOM by Teuvo Kohonen: [2] Luo X., Zincir-Heywood A. N., A Comparison of SOM Based Document Categorization Systems, Proceedings of the IEEE International Joint Conference on Neural Networks, pp , [3] Kayacik H. G., Zincir-Heywood A. N., Using Self-Organizing Maps to Build an Attack Map for Forensic Analysis, Proceedings of the ACM International Conference on Privacy, Security, and Trust (PST 2006), pp , [4] Li Y., Venter H.S., and Eloff J. H. P., Categorizing Vulnerabilities using Data Clustering Techniques. Available online at: [5] N-gram: [6] Amine, A., Elberrichi, Z., Simonet, M., & Malki, M., "Evaluation and Comparison of Concept Based and N-Grams Based Text Clustering Using SOM". Available online at: [7] Berger, H., & Merkl, D., "A Comparison of Support Vector Machines and Self--Organizing Maps for Categorization". Available online at: [8] Common Vulnerabilities and Exposures (CVE). [9] Stopwords. [10] Ted Pedersen Ngram Statistics Package (NSP). [11] [12] SOM Toolbox. 14

16 15 Appendix A: Stop Words a about above across after afterwards again against all almost alone along already also although always am among amongst amoungst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before beforehand behind being below beside besides between beyond bill both bottom but by call can cannot cant co computer con could couldnt cry de describe detail do done down due during each eg eight either eleven else elsewhere empty enough etc even ever every everyone everything everywhere except few fifteen fify fill find fire first five for former formerly forty found four from front full further get give go had has hasnt have he hence her here hereafter hereby herein hereupon hers herself him himself his how however hundred i ie if in inc indeed interest into is it its itself keep last latter latterly least less ltd made many may me meanwhile might mill mine more

17 16 moreover most mostly move much must my myself name namely neither never nevertheless next nine no nobody none noone nor not nothing now nowhere of off often on once one only onto or other others otherwise our ours ourselves out over own part per perhaps please put rather re same see seem seemed seeming seems serious several she should show side since sincere six sixty so some somehow someone something sometime sometimes somewhere still such system take ten than that the their them themselves then thence there thereafter thereby therefore therein thereupon these they thick thin third this those though three through throughout thru thus to together too top toward towards twelve twenty two un under until up upon us very via was we well were what whatever when whence whenever where whereafter whereas whereby wherein whereupon wherever whether which while whither who whoever whole whom whose why will with within without would yet you your yours yourself yourselves

18 Appendix B: Unlabeled U-Matrix Representations Figure B.1. 6x6 Bigram approach with 1 million epochs Figure B.2. 6x6 Word approach with 1 million epochs 17

19 Appendix C: Labeled U-Matrix Representations Figure C.1. 6x6 Bigram approach with 1 million epochs (Left) 18

20 Figure C.2. 6x6 Bigram approach with 1 million epochs (Right) 19

kwic.py: A Python module to generate a Key Word In Context (KWIC) index

kwic.py: A Python module to generate a Key Word In Context (KWIC) index : A Python module to generate a Key Word In Context (KWIC) index Abstract John W. Shipman 2011-11-16 15:45 KWIC (Key Word In Context) is a venerable method for indexing text. This publication describes