Novel Approaches To The Classification Of High Entropy File Fragments

Size: px

Start display at page:

Download "Novel Approaches To The Classification Of High Entropy File Fragments"

Randolph Lloyd
5 years ago
Views:

1 Novel Approaches To The Classification Of High Entropy File Fragments Submitted in partial fulfilment of the requirements of Edinburgh Napier University for the Degree of Master of Science May 2013 By Philip Penrose School Of Computing

2 Authorship Declaration I, Philip Penrose, confirm that this dissertation and the work presented in it are my own achievement. Where I have consulted the published work of others this is always clearly attributed; Where I have quoted from the work of others the source is always given. With the exception of such quotations this dissertation is entirely my own work; I have acknowledged all main sources of help; If my research follows on from previous work or is part of a larger collaborative research project I have made clear exactly what was done by others and what I have contributed myself; I have read and understand the penalties associated with Academic Misconduct. I also confirm that I have obtained informed consent from all people I have involved in the work in this dissertation following the School's ethical guidelines Signed: Date: Matriculation no: P Penrose, MSc Advanced Security and Digital Forensics, 2013 ii

3 Data Protection Declaration Under the 1998 Data Protection Act, The University cannot disclose your grade to an unauthorised person. However, other students benefit from studying dissertations that have their grades attached. Please sign your name below one of the options below to state your preference. The University may make this dissertation, with indicative grade, available to others. The University may make this dissertation available to others, but the grade may not be disclosed. The University may not make this dissertation available to others. P Penrose, MSc Advanced Security and Digital Forensics, 2013 iii

4 Abstract In this thesis we propose novel approaches to the problem of classifying high entropy file fragments. We achieve 97% correct classification for encrypted fragments and 78% for compressed. Although classification of file fragments is central to the science of Digital Forensics, high entropy types have been regarded as a problem. Roussev and Garfinkel [1] argue that existing methods will not work on high entropy fragments because they have no discernible patterns to exploit. We propose two methods that do not rely on such patterns. The NIST statistical test suite is used to detect randomness in 4KB fragments. These test results were analysed using Support Vector Machines, k-nearest-neighbour analysis and Artificial Neural Networks (ANN). We compare the performance of each of these analysis methods. Optimum results were obtained using an Artificial Neural Network for analysis giving 94% and 74% correct classification rates for encrypted and compressed fragments respectively. We also use the compressibility of a fragment as a measure of its randomness. Correct classification was 76% and 70% for encrypted and compressed fragments respectively. Although it gave poorer results for encrypted fragments we believe that this method has more potential for future work. We have used subsets of the publicly available GovDocs1 Million File Corpus so that any future research may make valid comparisons with the results obtained here. P Penrose, MSc Advanced Security and Digital Forensics, 2013 iv

5 Contents ABSTRACT... IV 1 INTRODUCTION Context Background Problem Aims and Objectives Structure of remainder of the thesis LITERATURE REVIEW Introduction File Fragment Identification Entropy Complexity and Kolmogorov Complexity Statistical Methods Linear Discriminant Analysis Multi-centroid Model Support Vector Machine Lempel-Ziv Complexity Specialised Approaches Genetic Programming High Entropy Fragment Classification Randomness Compressibility P Penrose, MSc Advanced Security and Digital Forensics, 2013 v

6 2.4 Conclusions DESIGN Introduction Building the Corpus Compression Methods Encryption Methods Corpus Creation Fragment Analysis Tools Testing Randomness The NIST Statistical Test Suite Testing Compressibility Conclusions IMPLEMENTATION AND RESULTS Statistical Analysis of Randomness k-nn SVM Compression Combined Results Fragment Size KB Fragments Artificial Neural Network Analysis of Results NIST Results Compression Results P Penrose, MSc Advanced Security and Digital Forensics, 2013 vi

7 4.5.3 A compressed fragment type classifier A Useable Classifier Limitations NIST Statistical Analysis Suite Compressed File Classification Compression Level Conclusions CONCLUSION Overview Appraisal of Achievements Critical evaluation of the state of the art Devising approaches to classifying high entropy file fragments Implement approaches in a publically verifiable way Future work APPENDIX 1 FAILED DISCRIMINATORS APPENDIX 2 CODE SNIPPETS APPENDIX 3 PROJECT PLANNING GANTT CHARTS APPENDIX 4 PROJECT DIARY P Penrose, MSc Advanced Security and Digital Forensics, 2013 vii

8 List of Tables Table 1 Example Confusion Matrix Table 2 The NIST Statistical Test Suite Table 3 Type I and Type II Errors Table 4 Analysis of k-nn results classified by category Table 5 Analysis of k-nn results by fragment type Table 6 - Analysis of SVM results classified by category Table 7 Analysis of SVM results by fragment type Table 8 Analysis of Fragment Compression classification by category Table 9 Analysis of Combined Classification results Table 10 - knn results with 8KB fragment size Table 11 - Weight Analysis of NIST statistical tests Table 12 knn results by category, no NIST Serial test Table 13 - knn analysis by fragment type, no NIST Serial test Table 14 - ANN analysis by category Table 15 ANN Analysis by file type Table 16 - knn compressed file misclassification Table 17 - SVM compressed file misclassification Table 18 - knn Misclassified files Table 19 - SVM Misclassified files Table 20 - Misclassified compressed fragments common to both classifiers Table 21 - Misclassified compressed fragments common to both NIST and Compression classifiers Table 22 - SVM Analysis of compressed types Table 23 - Number of files classified as each type Table 24 - ANN Analysis of results P Penrose, MSc Advanced Security and Digital Forensics, 2013 viii

9 List of Figures Figure 1 - Zip file showing the first two bytes as the file type 'Magic Numbers'... 2 Figure 2 - Optimal Separating Hyperplane Figure 3 - Data not linearly separable Figure 4 - Optimally separating hyper-surface Figure 5 - Compressed file visualisation 54 Figure 6 - Encrypted file visualisation...60 P Penrose, MSc Advanced Security and Digital Forensics, 2013 ix

10 Acknowledgements Professor Bill Buchanan and Rich Macfarlane of Edinburgh Napier University have truly inspired me over the last two years. I could not have completed this work without unstinting support from my wife and daughter. P Penrose, MSc Advanced Security and Digital Forensics, 2013 x

11 1 Introduction 1.1 Context In [3], Garfinkel claims that much of the progress made in digital forensic tools over the last decade is becoming irrelevant. These tools were designed to help forensic examiners to find evidence, usually from a relatively low capacity hard disk drive, and do not scale to the capacity of digital storage devices commonly available today. In addition, time may be a critical factor at the initial stage of a forensic investigation. Giordano [4] describes the importance in military and antiterrorist activities for forensic examiners to get a quick overview of the contents of seized digital media, whereas Rogers and Goldman [5] relate how criminal investigations may depend on the quick analysis of digital evidence on-site. Current digital forensic techniques and tools are not suited to such scenarios. They are aimed mainly at post crime analysis of digital media. In a time critical situation a forensic investigator needs a data analysis tool that can quickly give a summary of the storage device contents. This will inform the decision by the investigator on whether the media deserves prioritisation for deeper analysis or not. Garfinkel [6] puts forward the hypothesis that the content of digital media can be predicted by identifying the content of a number of randomly chosen sectors. The hypothesis is justified by randomly sampling 2000 digital storage devices to create a forensic inventory of each. It was found that, in general, sampled data gave similar statistics to the media as a whole. Using this methodology to quickly produce a summary of storage device contents requires that data fragments be identified accurately. Research in the area of P Penrose, MSc Advanced Security and Digital Forensics,

12 fragment classification has advanced over the last few years, so that many standard file types can be identified accurately from a fragment. Methods of classification fall into three broad categories: direct comparison of byte frequency distributions, statistical analysis of byte frequency distributions and specialised approaches which rely on knowledge of particular file formats. However none of these methods has been successful in the classification of high entropy file fragments such as encrypted or compressed data. 1.2 Background Historically, file type identification has been achieved through either using the file type extension as in Microsoft operating systems or through using a library of file signatures as used by the Unix file command [7]. Many file types have these signatures, called magic numbers, embedded within the file header. These tend to be 2-byte identifiers in the header of the file. For example a compressed ZIP file will have 0x504B as the first two bytes at the beginning of the file (ASCII codes for the characters PK - the initials of the developer). This is illustrated in Figure 1 which shows the output from a hex editor while viewing a zip file. Figure 1 - Zip file showing the first two bytes as the file type 'Magic Numbers' In addition some file formats have characteristic byte sequences repeated within them. In a JPEG file, a 0x00 byte is appended to each occurrence of the byte 0xFF within the body of the file, thus giving a unique signature to this file type. It should be noted that since the JPEG format is so easily identified, it has been excluded as a compressed type for the remainder of this work. However these P Penrose, MSc Advanced Security and Digital Forensics,

13 simple methods of file type recognition are not available to us. It is unlikely that the single file fragment being examined will include the file header. Similarly the file fragments used are unlikely to be long enough to show any significant difference in signature byte frequency. Classifying a file fragment from digital media based solely on its content rather than on metadata within the header is a difficult problem. A number of approaches have been proposed which have displayed varying degrees of success. Many of these methods apply statistical analysis to the byte frequency distribution within the fragment to create a fingerprint or centroid which is held to be characteristic for each file type. An unknown fragment is then classified by its closeness, measured using some metric (a measure of the distance between objects in a vector space), to these centroids. Others use machine learning techniques. A training set and a test set are created from a corpus consisting of a collection of files of varying types. The results of the statistical analysis done on the training set are used by the machine learning algorithm to create a classifier. The test set can then be used as input to the classifier and its performance measured. If the results from the classifier are satisfactory then it may be used on unknown file fragments. 1.3 Problem Recent research has found that although classification of file fragments of many common file types can be done with high accuracy, the classification of high entropy file fragments is difficult and to date accuracy has been poor [8], [9]. Indeed Roussev [1] suggest that it is not possible using the current statistical and machine learning techniques to differentiate such fragments. The entropy of a file fragment measures the amount of randomness or disorder in the data within it. Compressed and encrypted files have little pattern or order. A file is compressed by representing any repeating patterns of data with a shorter code. A well compressed file should have no apparent patterns remaining in it otherwise it could be compressed further. An encrypted file should have no P Penrose, MSc Advanced Security and Digital Forensics,

14 patterns otherwise it would be vulnerable to cryptanalysis. Thus these types are classified as high entropy. Garfinkel [2] states that the classification of high entropy file fragments is in its infancy and needs further research and it is suggested that future works should investigate methods to improve classification performance on such fragments. To date no such investigations have been done. 1.4 Aims and Objectives The main aim of this thesis is to investigate and devise methods for the classification of high entropy file fragments in a forensic environment, and the objectives are: Critically evaluate the state of the art in file fragment analysis. Devise approaches to the classification of high entropy file fragments. Implement these approaches in a manner which makes our results available for other researchers to validate or for them to do a direct comparison against the methods which we develop. Critically evaluate these approaches. 1.5 Structure of remainder of the thesis In chapter 2 we take a critical look at related work. We find that many methods for fragment classification exist and have been successful for many fragment types. However high entropy fragments remain difficult to classify. The review also points to the methodology which we develop for high entropy fragment classification. In chapter 3 we explain and justify our design and methodology. We describe how we built our training and testing corpora from a publically available corpus. We describe the tools that we use to measure randomness and compressibility in our experiments. The experiments are described and the results are reported and critically analysed in chapter 4. We conclude in chapter 5 and give ideas for future work. P Penrose, MSc Advanced Security and Digital Forensics,

15 2 Literature Review 2.1 Introduction This chapter presents the current research into file fragment classification. There is a theme running through the research that results in classifying high entropy fragment types has been universally poor. Many investigations have simply excluded these fragment types from their corpora. Roussev [1, p.11] questions whether current machine learning or statistical techniques applied to file fragment classification can ever distinguish between these types since these fragments have no discernible patterns to exploit. These findings lead us to develop our approaches to the problem. It can be observed that there has been a trend towards specialised approaches for each file type. Analysis of byte frequency distributions has often proved insufficient to classify fragment types, and the unique characteristics of particular file formats have increased recognition accuracy. It becomes apparent that in many cases neither the digital corpora used nor the software developed is publicly available. It is therefore not possible to validate the research nor do a direct comparison against the methods which we develop. In addition, many results have been derived from small sample sets and thus the results may not be universally applicable. These observations lead us to design our investigation in a manner which will avoid such criticisms. 2.2 File Fragment Identification The idea of using examination of the byte frequency distribution (BFD) of a file to identify a file type was introduced by McDaniel [10]. The BFD is simply a count of the frequency of occurrence of each possible byte value (0-255) giving a 256 element vector. Several of these vectors are averaged by adding corresponding elements and dividing by the number of vectors to give an average vector or P Penrose, MSc Advanced Security and Digital Forensics,

16 centroid. For each byte-value the correlation between byte frequency in each file is also recorded. These vectors were taken to be characteristic of that file type and termed the fingerprint for that type. The BFD of an unknown file type is subtracted from the fingerprint and the differences averaged to give a measure of closeness. Another fingerprint was developed using a byte frequency crosscorrelation (BFC) which measured the average difference in byte pair frequencies. This was a specialised approach which should target certain file types - HTML files, for example, where the characters < and > occur in pairs regularly. We shall see that developing such specialised approaches to identify file fragments when byte frequencies of different types are similar is a common occurrence through the research corpus. Average classification accuracy is poor with BFD at around 27% and BFC at around 46%. This result is actually poorer than it appears since whole files were used instead of file fragments and so the file headers which contain magic numbers were included. When file headers were considered by themselves they reported an accuracy of over 90% as would be expected. The corpus consisted of 120 test files with four of each of the 30 types considered. There is no indication as to the source of the test files and no encrypted files were included. They noted that the ZIP file format had a low assurance level and that perhaps other classification methods might be needed to improve the accuracy for this type. Li [11] extended the work of [10] and developed the methodology that had been introduced by Wang [12] for a payload based intrusion detection system. They used the term n-gram to refer to a collection of n consecutive bytes from a byte stream and based their analysis on 1-grams. The terms fileprint and centroid were used interchangeably to refer to a pair of vectors. One contained bytefrequencies averaged over a set of similar files from the sample, and, in the other, the variance of these frequencies. In addition, they created a multi-centroid approach by creating several such centroids for each file type since files with the same file extension do not always have a distribution similar enough to be represented by a single model. The metric used to calculate the distance of an P Penrose, MSc Advanced Security and Digital Forensics,

17 unknown sample from each centroid was that used in [12]. It was proposed originally for computational efficiency in the scenario of a high bandwidth networked environment but this simplification meant that it was no longer suitable as a metric. It was termed a simplified version of the Mahalanobis distance and given as: d(, ) = where - xi and yi are the centroid and unknown sample byte frequencies respectively. - σi is the centroid standard deviation for that byte value. - is a small positive value added to avoid possible division by zero. To be classified as a metric, a function should satisfy the following conditions for all points x, y, z in the space (Copson, 1988, p. 21): 1. d(x,y) 0 2. d(x,y) = 0 if and only if x = y 3. d(x,y) = d(y,x) 4. d(x,z) = d(x,y) + d(y,z) In the simplified metric we can ignore the denominator since it is simply a scaling factor. We have: d(x, ) = Consider the vectors x = (4,7,3) and y = (1,1,2) Using this metric then: d(x, ) = (4-1) + (7-1) + (3 2) = 10 If z = (0, 0, 0) then: P Penrose, MSc Advanced Security and Digital Forensics,

18 d(x, ) = 14 and d(z, ) = 4 Hence: d(x, ) d(x, ) + d(z, ) Thus their simplified metric does not meet the criteria for being a metric. Xing [14] describes how learning algorithms depend critically on a good metric being chosen. We find that many of the techniques considered below which use machine learning to create centroids give no justification for the metrics used. Unfortunately only 8 file types were considered, and no encrypted or compressed files were used, although they noted that all compressed files may have a similar distribution. To create the test corpus 100 files of each type were collected from the internet using a search on Google. To create their sample set, the first 20, 200, 500 and 1000 bytes from a file were taken. As expected, since these truncated files all contain the file header, the results were good. The average classification accuracy was around 99% for the truncated files with just the first 20 bytes of the file i.e. the file header. As the file size increased, accuracy decreased. Accuracy was worst when the whole file, rather than a truncated segment, was used. This could be explained by the fact that with just the first 20 bytes of a file the method reduces to the magic numbers solution referred to previously. As the file size increases the influence of these first magic numbers is diluted and hence accuracy decreases. The method named Oscar was introduced by Karresand [15] using the same BFD vectors as [11] to create centroids. This was soon extended to increase the accuracy of JPEG detection by introducing Rate of Change (RoC) of consecutive bytes [16]. The rate of change is maximum for pairs of bytes 0xFF followed by 0x00. As explained earlier the frequency of this byte pair is a unique marker for JPEG files. They avoid the criticism made of [11] by using a weighted Euclidian metric: P Penrose, MSc Advanced Security and Digital Forensics,

19 d(, ) = to measure distance between an unknown sample and the centroid. If this distance was below a certain threshold then it was taken to be a fragment of that file type. There is no indication as to the source of their file corpus. 57 files were first padded with zeros to ensure that the file was a multiple of 4kB to simulate an unfragmented hard disc of 4kB clusters. These 57 files were then concatenated to form one large 72MB file. The file was scanned for each file type separately, and each 4kB block was examined. For compressed (zip) files a fragment was marked as a hit even if it contained header information. The authors noted that compressed types were difficult to tell apart because of the random nature of their byte distribution Entropy Hall [17] departed from the BFD approach by suggesting that the entropy and compressibility of file fragments be used as identifiers. They used the idea of a sliding window to make n steps through the fragment. Compressibility and entropy values calculated at each step were saved as elements of the characteristic vectors, although the window contents were not actually compressed. The LZW algorithm was used and the number of changes made to the compression dictionary used as an indication of compressibility. Centroids were calculated as usual by averaging element values in a training set. Two metrics were evaluated the Pearson rank order correlation coefficient and a simple difference metric similar to that used by [10]: d(, ) =. This metric suffers from the same criticism as that of [11] and results were poor. Identification of compressed fragment was only 12% accurate and other results P Penrose, MSc Advanced Security and Digital Forensics,

20 were not given. It was noted that the method might be better at narrowing down possible file types than actually assigning a file type. The initial corpus was a set of files from the authors personal computer Complexity and Kolmogorov Complexity Veenman [18] combined the BFD together with the calculated entropy and Kolmogorov complexity of the fragment to classify the file fragment. The Kolmogorov complexity is a measure of the information content of a string which makes use of substring order. A large corpus of 13 different file types was used with 35,000 files in the training set and 70,000 in the test set. HTML and JPEG fragments had over 98% accuracy, however compressed files had only 18% accuracy with over 80% false positives. Results were presented in a confusion matrix which made them easy to interpret. Confusion matrices are used by other researchers and will be used in presenting our results, so a short explanation is given here. A confusion matrix is a convenient method of conveying information about the actual types and predicted types made by a classification system. In the confusion matrix in Table 1 it can be seen that JPEG had 126 true positives fragments which were actually JPEG and labelled as such. The JPEG type had a total of 86 false negatives fragments which were actually JPEG but misclassified as another type. It is obvious that this classifier confuses ZIP and JPEG types. Predicted Type JPEG BMP ZIP Actual Type JPEG BMP ZIP Table 1 Example Confusion Matrix P Penrose, MSc Advanced Security and Digital Forensics,

21 2.2.3 Statistical Methods Another statistical approach was suggested by Erbacher [19]. Only statistics calculated from the BFD were used rather than the BFD itself. Their analysis used a sliding window of 1kB to examine each block of a complete file rather than file fragments. By using the sliding window, however, they could identify component data types within the containing file e.g. a JPEG image within a PDF file. They claimed that five calculated statistics were sufficient to classify the seven data types in the corpus of five files of each type but no results were presented. This idea was developed by Moody [20], where the corpus consisted of 25 files of each type examined, and no compressed or encrypted files were included. It was found that several data types could not be classified because of the similarity of their structure. These files were processed through a secondary pattern matching algorithm where unique identifying characteristics such as the high rate of occurrence of < and > in html files was used for identification. Here again we see the use of specialised functions for different file types Linear Discriminant Analysis Calhoun [21] followed the approach of [18] by using the linear discriminant analysis for classification but used a selection of different statistics. Linear discriminant analysis is used to develop a linear combination of these statistics by modifying the weight given to each so that the classification is optimised. A statistic that discriminates well between classes will be given a bigger weight than one that discriminates less well. They also included a number of tests that could be classified as specialised, such as their ASCII test where the frequency of ASCII character codes 32 to 127 can be used to identify text files such as HTML or TXT. The authors noted that the data did not conform to the requirements of the Fisher linear discriminant that data should come from a multi-variate normal distribution with identical covariance matrices and this may explain some sub-optimal results. Overall only four file types were included in the corpus with 50 fragments of each type, and no compressed or encrypted file P Penrose, MSc Advanced Security and Digital Forensics,

22 fragments were included. Fragments were compared in a pairwise fashion. For example JPEG fragments were tested against BMP fragments and the results of classification noted. JPEG was then tested against PDF and so on. There was no attempt at multi-type classification. Testing in such a way gives less chance of misclassification and the results should be interpreted with this in mind. Extensive tables of result for accuracy are given but again there is no data about false positives or true negatives. They noted that a modification to their methodology would be required to avoid the situation where the method fails and all fragments are classified as one type but which gives high accuracy Multi-centroid Model Ahmed [22] developed the methods of [11] but introduced some novel ideas. In creating their multi-centroid model they clustered files with similar BFD regardless of type. This implements their assumptions: 1. Different file types may have similar byte frequency distributions. 2. Files of the same type may have different byte frequency distributions. Within each such cluster, linear discriminant analysis was used to create a discriminant function for its fragment types. Cosine similarity was used as a metric and was shown to give better results than the simplified Mahalanobis distance. The cosine similarity is defined as the cosine of the angle between the centroid, x, and fragment, y, BFD vectors: Similarity = cos(x, y) = Since all byte frequencies are non-negative the dot product is positive and therefore the cosine similarity lies in the closed interval [0, 1]. If the cosine similarity is 1 then the angle between the vectors is 0 and they are identical other than magnitude. As the cosine similarity approaches 0, the vectors are increasingly dissimilar. P Penrose, MSc Advanced Security and Digital Forensics,

23 An unknown fragment was first assigned to a cluster using cosine similarity. If all file types in a cluster were of the same type then the fragment would be classified as that type. If not, then linear discriminant analysis was used to find the closest type match in the cluster. Ten different file types were used although compressed and encrypted types were excluded, and 100 files of each type were included in the training set and the test set. Whole files rather than file fragments were used and so header information was included, achieving 77% accuracy. Q. Li [9] used the BFD only and took a novel approach using a Support Vector Machine (SVM) for data fragment classification. Since SVMs are used exetensively in our experimentation, a brief background is given here Support Vector Machine A Support Vector Machine (SVM) is a supervised learning model which is used for classification. In supervised learning the training data is a set of data points each with an associated output type. The supervised learning algorithm will infer a classifier which will predict the correct output type for each input data point in the training set. The SVM algorithm plots each training data point in n-dimensional space and constructs an optimal hyper-plane to separate the two classes. Figure 2 shows that there are many possible classifiers. The one which maximizes the distance between the nearest data points and itself is the optimal hyper-plane. A data point will be classified according to which side of the hyper-plane that it lies. P Penrose, MSc Advanced Security and Digital Forensics,

24 d 2 d 1 d 3 Figure 2 - Optimal Separating Hyperplane maximizes the sum of the distances support vectors to the plane itself. Illustration based on an idea from [23] from the In some cases the data points are not linearly separable in n dimensions. Ben- Hur [24] recommends the use an SVM kernel which in this instance maps the n- dimensional space into a higher dimensional feature space where the data may be separable. Class A Class A Class B Class B Figure 3 - Data not linearly separable in n-dimensions is mapped to a higher dimensional space by the kernel function. Illustration based on an idea from [23] It is worth noting that the kernel mapping need not be linear and so the feature space may allow a separation hyper-surface rather than a hyperplane. P Penrose, MSc Advanced Security and Digital Forensics,

Figure 4 - Optimally separating hyper-surface. Illustration based on [9] used only four file types JPEG, MP3, DLL and PDF. There were no compressed or encrypted file types.

25 Figure 4 - Optimally separating hyper-surface. Illustration based on [9] used only four file types JPEG, MP3, DLL and PDF. There were no compressed or encrypted file types. The inclusion of the PDF file type may have affected their results since it is a container type it can embed a variety of other formats within itself such as JPEG, Microsoft Office or ZIP. Thus it might be difficult to differentiate a fragment of a file labelled PDF from some of the other types. Their training corpus was 800 of each file type downloaded from the internet. Each file was split into 4096 byte fragments and the first and last fragment discarded to ensure that header data and any possible file padding was excluded. The test set was created by downloading a further 80 files of each type from the internet and selecting 200 fragments of each type. Accuracy for classification of the four file types was 81%. By contrast to previous researchers, Axelsson [25] used the publicly available data set govdocs1 [26], making it easier for others to reproduce the results. The normalised compression distance (NCD) was used as a metric. NCD was introduced by [27]. Their idea was that two objects (not necessarily file P Penrose, MSc Advanced Security and Digital Forensics,

26 fragments) are close if one can be compressed further by using the information from the other. If C(x) is the compressed length of fragment x and C(x,y) is the compressed length of fragment x concatenated with fragment y then NCD(x, y) = ( ) ( ) ( ) The k-nearest-neighbour algorithm (knn) was used for classification. The knn approach to classification is simpler than the SVM. No classification function is computed. The training set is plotted in n-dimensional space. An unknown example is classified by being plotted and its k nearest neighbouring points from the training set determined. The unknown example will be assigned to the class which is most common among these k nearest neighbours. The results were poor and average accuracy was 35%. In [28], commercial off-the-shelf software was evaluated against several statistical fragment analysis methods. SVM and knn using the cosine similarity metric methods were used. The test corpus was created from the publically available RealisticDC dataset [26] and consisted over 36,000 files of 316 file types. The file types are not listed so we do not know if any compressed or encrypted files were included. The true file types were taken to be those reported by Libmagic using the Linux file command and so no account was taken of possible anti-forensic techniques such as file header modification which may have resulted in bias in the results. It was not reported if the first fragment of each file was included. This would have contained header information. The results reported on the performance on file fragments was for the SVM only and it is shown that they achieved 40% accuracy measured using the Macro-F1 measure [29] with a 4096 byte fragment size. There is no breakdown of the results by file type and so we cannot gauge if some high accuracy file types are masking some with very low accuracy. P Penrose, MSc Advanced Security and Digital Forensics,

27 2.2.7 Lempel-Ziv Complexity Sportiello [30] tested a range of fragment features including the BFD, entropy, Lempel-Ziv complexity and some specialised classifiers such as the distribution of ASCII character codes which would characterise text based file types, and Rate of Change which we have seen is a good classifier for JPEG files. The corpus consisted of nine file types downloaded from the Internet and these were decomposed into a total of blocks of 512 bytes for each file type. There is no indication if file header blocks were included. No compressed or encrypted data was included. SVM was used but no multi-class classification was attempted. For each file type a separate SVM model was created to classify a fragment type against each of the other types individually. The experiment was actually run using a 4096 byte fragment size and no indication of how these fragments were created is given. There is no confusion matrix of the results and there is no mention of false negative results. The table of results is arranged by fragment type, feature and feature parameter (the C and γ for the SVM). These parameters vary by file type and it appears that only the results for the best parameter values for individual fragment types is given. It is therefore difficult to compare results with others. Fitzgerald [8] used the publicly available govdocs1 data set. They created 9,000 fragments of 512 bytes for each of 24 file types. The first and last fragment of each file was omitted. There were an equal number of fragments of each file type. They used an SVM using standard parameters. Both unigram and bigram (pairs of consecutive bytes) along with various statistical values were used for the fileprint. They selected up to 4000 fragments for the data set and apportioned these approximately in the ratio 9: 1 as training and test sets respectively for the SVM. There is no mention of whether the 24 file types were represented equally in the selection. An overall accuracy of 47.5% was achieved but correct compressed file prediction averaged 21.8%. It was noted, as did [2], [9], [31] that the classification of high entropy file fragments was challenging. P Penrose, MSc Advanced Security and Digital Forensics,

28 2.2.8 Specialised Approaches It is argued by Roussev [1] that using BFD or statistical methods is too simplistic. They advocate using all tools available for file fragment discrimination. If distinct markers are present within a fragment then these should be used. If a byte pattern suggests a particular file type then knowledge of that file format can be used to check the classification. Thus they would be using purpose-built functions for each file type. They suggest a variety of approaches. As well as the header recognition and characteristic byte sequences as explained in the introductory section they use frame recognition. Many multimedia formats use repeating frames. If the characteristic byte pattern for a frame marker is found then it can be checked if another frame begins at the appropriate offset. If it does then it is likely that the fragment will be of that media type. It should be noted that, unlike other methods, they may require previous or subsequent file fragments. If a fragment type cannot be classified, then they use Context Recognition where these adjacent fragments are also analysed. Although the govdocs1 file corpus was used, this was supplemented by a variety of MP3 and other files that were specially developed. This does not fit well with their own views expressed in [26] where a strong case was made for the use of standardised digital corpora. A discriminator for Huffman coded files was developed. However it s true positive rate was 21%. The need for further research in this area was stressed in the paper Genetic Programming A novel approach using Genetic Programming was tested by Kattan [32]. 120 examples of each six file types were downloaded at random from the Internet and no compressed or encrypted files were included. Analysis was done on whole files rather than fragments and so file headers may have been included. Features were first extracted from the BFD using Principal Component Analysis (PCA) and passed to a multi-layer artificial neural network (ANN) to produce fileprints. PCA removes redundancy by reducing correlated features to a feature set of P Penrose, MSc Advanced Security and Digital Forensics,

29 uncorrelated principal components which account for most of the structure in the data. This removal of features is generally accompanied by a loss of information [33, p.562]. PCA was used here to reduce the number of inputs to the next stage - a multi-layer auto-associative neural network (ANN) which creates the fileprint. It is mentioned that file headers in themselves were not used, but would be part of the whole file. A 3 layer ANN was used with these fileprints as a classifier for unknown file types. Only 30 files of each type were used for testing. Results were reported in a confusion matrix and averaged 98% true positives. It is not clear whether the PCA would have extracted file headers as the most prominent component of the test data as a classifier. If this was so then the high detection rate would be explained. This work was extended by Amirani [34] to include detection of file fragments. The original version using an ANN as the final stage classifier was compared with classification using an SVM. 200 files of each of the 6 file types were collected randomly from the internet. Half were used as the training set and half as the testing set. For the fragment analysis a random starting point was selected in each file and a fragment of 1000 or 1500 bytes was taken. Results showed that the SVM classifier gave better results than the ANN for file fragments of both 1000 and 1500 bytes with extremely good results. It is puzzling that PDF detection gives 89% true positives with the 1500 byte fragments. The PDF format is a container format and might contain any of the other file types examined doc, gif, htm, jpg and exe as an embedded object. The random selection of 1500 bytes from within such a file could be misclassified as the embedded object type. The high detection rate for the PDF type itself means that this must have rarely happened. Perhaps it is an indication that the sample set of 100 files is too small, or perhaps the file header has an undue influence on the PCA. 2.3 High Entropy Fragment Classification In [2, p. S22] it is noted that The technique for discriminating encrypted data from compressed data is in its infancy and needs refinement. This is supported P Penrose, MSc Advanced Security and Digital Forensics,

30 by our observation that, in the literature considered so far, there has been little mention of classification of high entropy types. Where compressed fragments have been included in the test corpus, results have been poor. There is no source that deals with classification of encrypted or random fragments. This area of research has been recognised as difficult [2], [8], [9], [31]. Most results rely on patterns within the data. However [1] argues that compressed and encrypted file types have no such patterns. If a compressed file has patterns then it could be compressed further. If an encrypted file has patterns then it would be vulnerable to cryptanalysis. We need therefore to investigate methods of fragment identification that do not rely on patterns within the data Randomness In Chang [35], the output from a number of compression algorithms and compression programs was tested for randomness. The NIST Statistical Test Suite [36] was used. It was found that every compression method failed the NIST tests. In the testing of the candidate algorithms for AES it was expected that encrypted files should be computationally indistinguishable from a true random source [37]. Zhao [38] used 7 large (100MB) test files and compressed and encrypted them by different methods. The whole 188 NIST tests were run against each file. At this scale they achieved good discrimination of encrypted files. These observations lead us to our first hypothesis - that we can distinguish between compressed and encrypted fragments by testing for randomness Compressibility Ziv [39] stated that a random sequence can only be considered such if it cannot be compressed significantly. Schneier [40, Ch. 10.7] noted that Any file that cannot be compressed and is not already compressed is probably ciphertext. Mahoney [41, Ch ] states that encrypted data cannot be compressed. P Penrose, MSc Advanced Security and Digital Forensics,

31 By contrast, compression algorithms always have to compromise between speed and compression [38, p. 5]. It is unlikely that a compressed fragment is optimally compressed and therefore can be compressed further. Our second hypothesis, therefore, is that we can differentiate between compressed and encrypted fragments by applying an efficient compression algorithm. A compressed file should compress more than an encrypted file. 2.4 Conclusions In recent research we have seen that many file fragment types can be identified with high accuracy. However the classification of high entropy file fragments has been found to be difficult and to date accuracy has been poor. No work has been done on encrypted file types. It has been suggested that it is not possible using the current statistical and machine learning techniques to differentiate between high entropy file fragments. Techniques to do so are in their infancy and methods to improve the classification performance need to be investigated. We have also seen a variety of approaches employed in classification. Support Vector Machines, Artificial Neural Networks, Genetic Programming and k- Nearest-Neighbour have all been used. However if patterns within the fragment are not there to exploit, as in high entropy file fragments, then simply changing the classification method will not alter the fact. We therefore concluded that a new approach would be needed. We found that lack of compressibility and randomness are characteristics of high entropy file fragments. We shall therefore investigate the use of compression and the analysis of randomness in our research into classification of these file types. P Penrose, MSc Advanced Security and Digital Forensics,

32 3 Design 3.1 Introduction To test our hypotheses we need to create a test corpus to test the classification methods that we develop to distinguish between encrypted and compressed file fragments. In this section we describe how we use publically available corpora so that our experimental results are repeatable by others. We use randomisation so that we can avoid bias. We also describe how we devised our own methodology to test our hypotheses. 3.2 Building the Corpus In the scientific method it is important that results be reproducible. An independent researcher should be able to repeat the experiment and achieve the same results. We have seen in our review of related work that this is not generally the case. Most research has been done with private or irreproducible corpora generated by random searches on the WWW. Garfinkel [26] argues that the use of standardised digital corpora not only allows researchers to validate each other s results, but also to build upon them. By reproducing the work they have shown that they have mastered the process involved and are then better able to advance the research. Such standardised corpora are now available. The Govdocs 1 corpus contains nearly 1 million files [26, p. S6]. For research purposes a set of 10 subsets, each containing 1000 randomly chosen documents from the corpus, is available. Three of these subsets were chosen randomly. Subset 0 was used while developing our methodology. Subset 4 was used for training our classifiers and subset 7 for testing. This avoids any bias introduced by including fragments used in development or training as part of the test corpus. P Penrose, MSc Advanced Security and Digital Forensics,

33 We need to create file fragments of representative compressed and encrypted types from these subsets. We can assume that multimedia types which use lossy compression have been classified by the techniques used in our review of related work. We will therefore consider only lossless compression methods in the remainder of our research Compression Methods There are a number of lossless compression methods. In order that our corpus is representative of compressed files in the wild we shall create it using four of the most common. Compressors can be categorised as either stream based, like zip, gzip, and predictive compressors based on prediction by partial matching (PPM), or block based, like bzip2, where a whole input block is analysed at once [27, p.7]. The commonly used Deflate compressed data format is defined in RFC It uses the LZ77 compression method followed by Huffman coding. It was originally designed by Phil Katz for the compression program PKZIP [43]. It uses LZ77 which achieves compression by replacing a repeated string in the data by a pointer to the previous occurrence within the data along with the length of the repeated string. This is followed by Huffman coding which replaces common symbols within the compressed stream by short codes and less common symbols with longer codes. This method is used by zip and gzip compressors. Although zip and gzip use similar methods, gzip is a compressor for single files whereas zip is an archiver. An archiver can compress multiple files into an archive and decompress single files from within the archive. Zip is a common format on Microsoft Windows platforms, but gzip is primarily a Unix/ Linux compressor. We will use both zip and gzip as representative of common archivers and compressors in the creation of our corpus. Bzip2 is a block coding compressor which uses run length encoding (RLE) and the Burrows-Wheeler transform. The B-W transform does not itself compress. It uses a block sort method to transform the data so that the output has long runs of P Penrose, MSc Advanced Security and Digital Forensics,

34 identical symbols which can be compressed efficiently by RLE. The final output is again Huffman coded. Bzip2 is a file compressor rather than an archiver in that it compresses single files only. We will use bzip2 as representative of a block based Unix/ Linux compressor. There are also several proprietary compression implementations that are commonly used. Winrar is one such. Its own archiving format is proprietary but it is based on LZ and PPM compression. PPM is another stream based compression method which uses an adaptive data compression technique using context modelling and prediction. PPM compressors use previous bytes (bits) in the stream to predict the next byte (bit). They are adaptive in that they adapt the compression algorithm automatically according to the data being compressed. The output is arithmetic rather than Huffman coded. Whereas Huffman coding is restricted to a whole number of bits, many modern data compressors use arithmetic coding which is not restricted by this limitation [43, Ch. 3.2]. It can work with all the compression methods above. Winrar is used in the creation of our corpus as an example of proprietary formats and arithmetic coding Encryption Methods In Microsoft Windows operating systems AES has been the default file and BitLocker drive encryption method since Windows XP. Triple DES has been available as an alternative [44]. AES is also used by popular open source encryption software such as Axcrypt and TrueCrypt. PGP, together with the open source GnuPG conforming to the OpenPGP standard in RFC4880 is the most widely used cryptographic system [45]. It uses AES, Triple DES, Twofish and Blowfish. We will use AES, Triple DES and Twofish as representative of encryption methods while creating our corpus. P Penrose, MSc Advanced Security and Digital Forensics,

35 3.2.3 Corpus Creation Hard discs store data in sectors. Since 2011, all hard disk drive manufacturers have standardised on a 4KB sector size. By emulation, these drives are backward compatible with the older 512 byte sector drives [46]. In addition, regardless of sector size, operating systems store data in clusters. The usual cluster size is 4KB [47]. We will therefore use 4096 bytes as our fragment size. Most file systems store files so that the beginning of a file is physically aligned with a sector boundary [2, p.s15]. To emulate randomly sampled disc sectors we will therefore assume that each file begins on a sector boundary and consists of 4K blocks. The first sector of any file will contain header information which may be used to identify a fragment type. If the file does not fill the last sector then this sector may contain padding or undefined content. For this reason we will exclude the first and last sector of any file from our corpus. Each file in the training corpus was compressed individually by each compression method and encrypted by each encryption method. This generated a total of 6,972 files. A fragment beginning on a 4K boundary and excluding the first and last fragments was randomly chosen from each file. Files which were less than 12KB after compression or encryption were excluded. It is not possible to select a random 4KB fragment from such files after first and last 4KB fragments are excluded. This generated a total of 4931 fragments in our training corpus. Exactly the same procedure was used on the testing corpus and this generated 4,533 fragments. 3.3 Fragment Analysis Tools In this section we consider methods available to test our hypotheses: 1. We can distinguish between compressed and encrypted fragments by testing for randomness. P Penrose, MSc Advanced Security and Digital Forensics,

36 2. We can differentiate between compressed and encrypted fragments by applying an efficient compression algorithm. A compressed file should compress more than an encrypted file. No published work has been done in this area and so we will devise our own methodology to test our hypotheses Testing Randomness The NIST Statistical Test Suite It is important that the output of an encryption algorithm is random, otherwise it would be subject to cryptanalysis. The NIST Statistical Test Suite [36] was used in randomness testing of the AES candidate algorithms to test if their output was truly random [37]. However [35] used the NIST tests and found that compressed data tended to fail randomness tests. We will use this finding to create our classifier. A compressed fragment should display poorer randomness than an encrypted one. We have modified the NIST test suite so that it can operate on multiple files and output the results in the correct format for our k-nn and SVM analysis. The NIST Statistical Test Suite consists of a set of 15 tests. These tests are summarised in the table below. Note that all tests are done on binary data and not bytes. We use n to denote the number of bits in a sequence. Test Name Description Sequence Size Recommendation Frequency (Monobit) Proportion of zeroes and ones n 100 Frequency within a block Runs Test Block Runs Test Binary Matrix Rank Discrete Fourier Test Splits sequence into N blocks of size M and applies the monobit test on each block Checks if total number of runs of length k, which are sequences of identical bits, is consistent with random data Runs test on data split into blocks of length M Checks for linear dependence between fixed length substrings Detects any periodic features in the sequence M 20, M > 0.1n N < 100 n 100 For n < 6272, M = 8 For n < 750K, M= 128 n n 1000 P Penrose, MSc Advanced Security and Digital Forensics,

37 Non-overlapping Template Matching Overlapping Template Matching Maurer s Universal Statistical Test Linear Complexity Serial Test Approximate Entropy Cumulative Sums Random Excursions Random Excursions Variant Checks number of occurrences of a target string of length m in N blocks of length M bits. Skips m bits when pattern found As non-overlapping but does not skip Detects if a sequence is significantly compressible Determines if the complexity of a sequence is such that it can be considered random Checks for uniformity every m bit sequence should have the same chance of occurring Compares the frequency of all overlapping patterns of size m and m + 1. These frequencies are compared against what would be expected of a random sequence Calculates the maximum distance from zero a random walk (the cumulative sum adjusted so 0 is represented by -1, and 1 by 1) achieves. Measures deviation from that expected of a random walk (as above) to certain states Calculates the number of times a given distance from origin is visited in a random walk. Detects deviations from that expected of random sequence n 10 6 n 10 6 n n 10 6 m < log2 n - 2 i.e. floor(log2 n) - 2 m < log2 n - 5 n 100 n 10 6 n 10 6 Table 2 The NIST Statistical Test Suite We will not use the binary matrix rank test, overlapping template matching, Maurer s Universal Statistical test, linear complexity or the random excursions tests. Our bit sequence is not long enough to make the results of these tests statistically valid. However, since we are using 4096 bytes = bits, our fragments allow us to use 64 binary sequences of 512 bits which satisfy the size requirements for all other tests. In our tests the null hypothesis, H0, is that our data is random. The alternative hypothesis, Ha, is that the sequence is non-random. We have set the significance level of our tests at The significance level, α, is the probability of a Type I error. A Type I error in this instance is concluding that our sequence is not random when in fact it is. Thus our significance level of 0.01 means that there is P Penrose, MSc Advanced Security and Digital Forensics,

38 a likelihood that one random fragment in every 100 will be misclassified as nonrandom. A Type II error occurs if our sequence is non-random but the test indicates that the sequence is actually random. The probability of a Type II error is denoted β. The detection rate is defined as 1 β. This is summarised in Table 3. Decision True Situation Accept H0 (Data is random) Accept Ha (Sequence is not random) Sequence is random (H0 is true) No Error Type I error Sequence is not random (Ha is true) Type II error No error Table 3 Type I and Type II Errors [36] suggest that the results of the tests are first analysed in terms of the number of our sequences passing each test. We are using 64 sequences of 512 bits. For each sequence a P-value is calculated. H0 is accepted if the P-value α. At our 0.01 level of significance it is expected that at least 60 of our 64 binary sequences making up the fragment will pass the test if the fragment is truly random. Secondly, if the fragment is random then the P-values calculated in the 64 sequence tests should be uniformly distributed. A P-value of P-values is calculated and if this P-value is greater than then the sequence of P values is taken as uniformly distributed. We run a total of nine statistical tests from the test suite on each file fragment. Each test generates two values - the number of the 64 sub-sequences of 512 bits passing the test and a uniformity value. Two of the tests report two results each and so we will have a total of 11 pairs of values generated to form our characteristic vector for each fragment. Thus our characteristic vector for file fragment is defined as the sequence where is the number of the 64 sequences passing test and those 64 tests. the uniformity P-value for P Penrose, MSc Advanced Security and Digital Forensics,

39 3.3.2 Testing Compressibility The probability that an encrypted (random) fragment will losslessly compress even by a small amount is low. Consider a random fragment of n bits. There are possible different fragments. Let P be the probability that the fragment will compress by 4 bytes (32 bits) or less. Then: P = If the fragment compresses by 32 bits or less then the fragment of size n must map to one of the fragments of size n-1, n-2,, n-32. There are only = such fragments. Since the compression is lossless, the decompression must map back to a unique original fragment. Thus the mapping must be 1 1. Thus there are only this many fragments which will compress by 4 bytes or less. Thus: P = = = = = 1 - The probability that a random fragment compresses by 32 bits or less is 1 - which is, for practical purposes, indistinguishable from 1. Also the probability P Penrose, MSc Advanced Security and Digital Forensics,

40 that a random fragment compresses by more than four bytes is therefore 1 P(compresses by 4 bytes or less) which is for practical purposes equal to zero. Thus if a fragment compresses by more than 4 bytes we can assume that it is not encrypted with high confidence. We will use this fact to classify our fragments. We need to use a compression algorithm which will meet several requirements. Firstly we need an algorithm which will compress more optimally than standard algorithms such as deflate. Secondly, deflate uses <length, distance> pairs and literals which are then Huffman coded. There is a chance that literal bytes will align on a byte boundary and so a bytewise compressor might see them. However in Huffman coding the data is packed as bits and the three bit header will throw out this alignment. Also literals are likely to become rare further into the stream. The situation is worse with dynamic Huffman coding as codes can be nearly any length. A bitwise compression algorithm, however, is not constrained by lack of byte alignment. It will be able to see repeating literals or <length, distance> pairs [48]. For these reasons we are going to use zpaq as our compressor. In addition to being a suitable bitwise compressor it uses bit prediction. It maintains a set of context models which independently create probabilities for the next bit. The probabilities are combined to make the prediction. We would expect that, by definition, it would not be possible to predict the next bit in a random fragment. In a fragment which has not been optimally compressed, however, there must be some remaining pattern and hence predictability otherwise it would be optimally compressed. Zpaq should be able to detect this and give a more optimal compression. Our classifier in this instance will be simply the compressed size of the fragment. 3.4 Conclusions In this chapter we have described and justified our methodology. We have shown how we created our training and test corpora of encrypted and compressed file fragments. It has been generated from publically available corpora so that an independent researcher should be able to repeat the P Penrose, MSc Advanced Security and Digital Forensics,

41 experiment and achieve the same results. Compression and encryption methods used in the creation of the corpus have been chosen to be representative of real world usage. We have excluded the possibility of file header or footer information from biasing the results. We have seen that compressed fragments may fail the NIST test suite for randomness. Encrypted fragments should not. We will use the output from the tests to create our characteristic vector for each fragment. These will be used in k-nn and SVM machine learning algorithms to first create our classifier and then test it. We have shown that a file fragment whose content is random will not compress significantly. A compressed fragment may compress further. We will use this to test our fragments for randomness. P Penrose, MSc Advanced Security and Digital Forensics,

42 4 Implementation and Results 4.1 Statistical Analysis of Randomness The NIST test suite was modified to operate in batch mode on multiple files and to output results in a format suitable to be used as our classification vectors. The modified suite was run against our training and our testing corpora separately. The characteristic vectors generated from the training corpus were used to create our classification models. Both k-nn and SVM analysis of the results was done. The models created were used to classify the test corpus k-nn Initial testing found that k=3 together with the Euclidian distance as a metric was optimum for the k-nn analysis. The tables below show the confusion matrices for the aggregated results and individual file types. Our training set, Corpus 4, was used to create the classification model. The model was then applied to our testing set, Corpus 7. Actual Type Predicted Type Encrypted Compressed Accuracy Encrypted % Compressed % Table 4 Analysis of k-nn results showing number of files classified by category Encrypted file fragments have been identified with high accuracy. Type I errors are low. However it is obvious from the Type II errors that 42% of compressed files are being identified as encrypted. An analysis of the classification by fragment type is shown in the Table 5. P Penrose, MSc Advanced Security and Digital Forensics,

43 Actual Type Predicted Type aes 3des 2fish bz2 zip rar gz aes 44.6% 27.4% 19.1% 3.8% 1.3% 2.2% 1.6% 3des 48.4% 26.2% 17.6% 3.7% 1.3% 0.7% 2.2% 2fish 44.2% 27.5% 20.3% 2.9% 1.2% 2.1% 1.8% bz2 23.0% 16.5% 9.5% 27.2% 5.8% 4.5% 13.5% zip 18.2% 13.1% 7.6% 20.5% 11.3% 9.4% 20.0% rar 20.1% 14.2% 9.2% 12.9% 14.7% 11.7% 17.1% gz 16.3% 13.1% 8.7% 17.7% 15.9% 10.8% 17.5% Table 5 Analysis of k-nn results by fragment type SVM An SVM using LibSVM [49] and a radial kernel was trained on our training set, Corpus fold cross validation with stratified sampling was used to optimise the model parameters. In 10-fold cross validation the training set is partitioned into 10 subsets. The SVM is trained on 9 subsets and the remaining one is used to test the resulting model. Each of the ten subsets is used in turn as the testing set. Parameters are chosen to minimise the error. In stratified sampling the subsets are chosen so that the proportion of file types in the subset reflects the overall proportion in the training set. The resulting classification model was used on our testing set, corpus 7. The confusion matrix is shown in Table 6. Actual Type Predicted Type Encrypted Compressed Accuracy Encrypted % Compressed % Table 6 - Analysis of SVM results showing number of files classified by category The encrypted fragment classification is better and compressed fragment classification similar to that of the k-nn analysis. The fragments were analysed by fragment type to identify areas for improvement. The table is shown in Table 7. P Penrose, MSc Advanced Security and Digital Forensics,

44 Actual Type Predicted Type aes 3des 2fish bz2 zip rar gz aes 95.9% 0.1% 0.3% 0.0% 0.0% 3.7% 0.0% 3des 97.0% 0.1% 0.4% 0.0% 0.0% 2.5% 0.0% 2fish 96.6% 0.3% 0.4% 0.0% 0.0% 2.8% 0.0% bz2 50.0% 0.4% 0.2% 24.6% 12.6% 10.1% 2.2% zip 37.6% 0.7% 0.0% 11.3% 27.2% 20.5% 2.7% rar 43.3% 0.7% 0.4% 5.0% 23.0% 23.6% 4.0% gz 36.5% 0.2% 0.4% 10.3% 26.9% 22.5% 3.4% Table 7 Analysis of SVM results by fragment type It is obvious that the largest error comes from the misclassification of compressed fragments as AES encrypted. 4.2 Compression Analysis of the compression results was simpler. We have one figure produced for each fragment the size of the compressed fragment. Our hypothesis was that a compressed fragment will compress more than an encrypted fragment. We classified the fragments by their compressed file size. If it was bigger than a given size then we classified it as compressed. Otherwise it was classified as encrypted. Any compressed file is accompanied by additional information which may include file headers, archive filenames, file paths, filenames, dates, Huffman tables and dictionaries depending on the compressor used. It quickly became apparent that this additional data added to our 4KB fragment when we compressed it totally masked the small changes in file size that we were trying to measure. It turned out that most fragments became larger after compression. This would not have mattered if the additionality was uniform. If this was so then the small byte size changes that we are expecting if our hypothesis is true would still be apparent. The confusion matrix below shows results for classifying our test corpus fragments using compression. The classification of compressed files is better than the NIST statistical tests. The correct detection rate is over 70%. P Penrose, MSc Advanced Security and Digital Forensics,

45 Type Actual Predicted Encrypted Compressed Encrypted 76% 24% Compressed 30% 70% Table 8 Analysis of Fragment Compression classification by category 4.3 Combined Results Since the detection rate for compressed fragments was higher using the compression test, we decided to merge the results to increase the overall compressed detection rate. We merged the results of the NIST test suite and Fragment Compression tests using the following algorithm: If Else Compression prediction = Compressed then Combined prediction = Compressed Combined prediction = NIST prediction The results are shown below. Predicted Encrypted Compressed Accuracy Encrypted % Compressed % Table 9 Analysis of Combined Classification results The detection rate for compressed fragments has increased markedly, as would be expected. However there were a number of files classified as encrypted by the NIST test suite which were classified as compressed by the compression test. The 530 files in the test corpus which were classified differently by each test method account for 11% of the test corpus. Future work should concentrate on this area to produce an accurate classifier for encrypted and compressed types. 4.4 Fragment Size We wished to test our methodology on different fragment sizes to evaluate the effect that fragment size had on classification accuracy. A smaller fragment size P Penrose, MSc Advanced Security and Digital Forensics,

46 Type would speed up analysis. However we could not use a 512 byte fragment. The NIST tests are not statistically valid with this fragment size. We decided to test if a larger fragment would improve classification accuracy KB Fragments We ran the same tests but used an 8KB fragment to see if there would be any improvement in classification accuracy. The results for the knn analysis are shown below. Predicted Encrypted Compressed Accuracy Encrypted % Compressed % Table 10 - knn results with 8KB fragment size It can be seen that there is a small improvement over the 4KB fragment size. The most notable is that compressed fragment detection has risen from 58% to 67%. However this test highlighted a problem. The NIST statistical tests do not scale linearly. Doubling the fragment size made the NIST test suite take over 2 hours, which is totally unacceptable. To investigate this we ran the NIST tests with a timer and disabled one test at a time. With the Serial test enabled, each fragment took, on average, 2.54 seconds to analyse. This is unacceptably long in our investigative forensic environment. With the Serial test disabled each fragment analysis took 0.05 seconds on average. These timings include the additional time needed for the timer and timer display code. We decided to do a performance evaluation on our characteristic vector components to find out how much of a contribution each was making to the classification. We can remove components if they have little influence on the classification. If the tests which produce these components are removed, then this should help to increase the speed of the analysis. P Penrose, MSc Advanced Security and Digital Forensics,

We set up a weight optimisation system using a recursive artificial neural network. This will allocate a weight for each component or attribute.

47 We set up a weight optimisation system using a recursive artificial neural network. This will allocate a weight for each component or attribute. The weight of each attribute reflects its overall effect on the classification. If an attribute has a very small weight then that attribute can be removed from our analysis without a major effect on the accuracy. We could not use a simple feed-forward neural network for attribute weighting because we could not fulfil the condition that all attributes are independent. We therefore decided to us an evolutionary genetic algorithm. The training data is fed to the artificial neural network (ANN) and weights are assigned to each attribute. The weighted attributes are used in cross validation with the training data and the results fed back to the ANN which produces a new set of weights. This continues until the weights converge when there is little or no difference between generations. The results are shown in Table 11. The weights are shown for each of the two values that NIST produces for each test the P uniformity measure and the number of bit streams passing each test. NIST Result Weight Table 11 - Weight Analysis of NIST statistical tests P Penrose, MSc Advanced Security and Digital Forensics,

48 Actual Type Type The results show that the Serial test does have some significance but does not have a large influence on the overall classification. We ran the classification process on 8K blocks again without the Serial test. It took under 4 minutes to analyse the 4,579 fragments in the training corpus and a similar time for the testing corpus. In a triage situation, the SVM has already been trained so that the NIST analysis of the training set does not have to be done. This can be improved considerably in future work. Results are shown in Table 12. Predicted Encrypted Compressed Accuracy Encrypted % Compressed % Table 12 knn results by category, no NIST Serial test The results are near identical to the 8KB test which used the Serial test showing that its removal did not affect the results significantly. The classification by file type is given in Table 13. Predicted Type aes 3des 2fish bz2 zip rar gz aes 45.6% 28.1% 19.3% 1.9% 1.4% 2.5% 1.2% 3des 43.3% 31.0% 20.0% 2.2% 0.4% 1.6% 1.5% 2fish 48.4% 27.3% 17.5% 2.3% 1.1% 1.4% 2.1% bz2 22.9% 13.7% 7.4% 30.7% 8.4% 5.5% 11.5% zip 11.6% 8.1% 5.8% 19.5% 16.8% 14.3% 23.9% rar 15.5% 12.0% 10.8% 12.7% 12.0% 13.3% 23.7% gz 10.6% 6.6% 4.8% 15.4% 15.4% 14.5% 32.6% Table 13 - knn analysis by fragment type, no NIST Serial test It can be seen that misclassification of encrypted types is extremely low. Misclassification of compressed types is more uniformly distributed than in the 4KB fragment analysis. Most of the improvement in compressed fragment classification comes from better classification of the zip format fragments. P Penrose, MSc Advanced Security and Digital Forensics,

49 Actual Type Type Artificial Neural Network An ANN was set up with 500 training cycles, a learning rate of 0.3, momentum 0.2 and error epsilon 1.0E-5. The 8K fragments were used with no serial test. The compressed fragment classification was better than for the SVM. Predicted Encrypted Compressed Accuracy Encrypted % Compressed % Table 14 - ANN analysis by category An analysis by file type is given in Table 15. Predicted Type aes 3des 2fish bz2 zip rar gz aes 1.6% 1.6% 90.1% 0.3% 0.0% 6.0% 0.3% 3des 3.3% 2.2% 89.0% 0.0% 0.0% 4.8% 0.7% 2fish 3.2% 1.9% 87.7% 0.1% 0.0% 5.6% 1.5% bz2 1.0% 0.6% 38.7% 26.8% 6.8% 14.5% 11.5% zip 0.4% 0.2% 17.8% 7.1% 15.3% 31.5% 27.8% rar 1.0% 0.6% 23.3% 5.3% 8.4% 41.4% 20.0% gz 0.8% 0.2% 14.3% 5.2% 16.2% 34.6% 28.8% Table 15 ANN Analysis by file type The encrypted files have mostly been classified as type fsh, as have most of the misclassified compressed files. However the overall correct category classification is high. 4.5 Analysis of Results NIST Results It is clear from the tables that the main problem with the classification system is the misclassification of compressed files. If we extract the values for compressed file misclassification from tables 5 and 7 we see that classification is poor. For the bzip2 files, half were misclassified. This indicates that the classifier is not working well. It can be seen that the totals for misclassification are similar for each method. P Penrose, MSc Advanced Security and Digital Forensics,

50 Actual Type Predicted Type aes 3des 2fish Total bz2 23.0% 16.5% 9.5% 49.0% zip 18.2% 13.1% 7.6% 49.0% rar 20.1% 14.2% 9.2% 43.5% gz 16.3% 13.1% 8.7% 38.10% Actual Type Predicted Type aes 3des 2fish Total bz2 50.0% 0.4% 0.2% 50.0% zip 37.6% 0.7% 0.0% 37.6% rar 43.3% 0.7% 0.4% 43.3% gz 36.5% 0.2% 0.4% 36.5% Table 16 - knn compressed file Table 17 - SVM compressed file misclassification misclassification The total number of compressed files misclassified by each method of analysis is nearly identical. Actual Type Predicted Type Encrypted Compressed Encrypted Compressed Actual Type Predicted Type Encrypted Compressed Encrypted Compressed Table 18 - knn Misclassified files Table 19 - SVM Misclassified files The ground-truth actual file types were used to analyse if the same compressed fragments were being misclassified by each method. It was found that the intersection of the sets contained 832 files. These were further analysed by file type with the results shown in Table 20. Number Misclassified as Encrypted File Type bz % zip % rar % gz % % of type Misclassified Table 20 - Misclassified compressed fragments common to both classifiers Compression Results We did the same analysis to calculate how many of the files misclassified by our compression classifier were common to those misclassified by the statistical P Penrose, MSc Advanced Security and Digital Forensics,

51 Actual Type analysis. There were 452 files in common. The number was expected to be smaller because the compression test misclassified fewer of the compressed fragments. An analysis of these 452 fragments by original file type is given in Table 21. Number Misclassified as Encrypted File Type bz % zip % rar % gz % % Misclassified Table 21 - Misclassified compressed fragments common to both NIST and Compression classifiers It is obvious that the classification of compressed fragment types is far better. However it is noticeable that large proportion of the fragments misclassified by the compression classifier belong to this common set A compressed fragment type classifier The 832 compressed file fragments that were misclassified, which include the fragments misclassified by the compression method, were used as a training set for an SVM. It was thought that these difficult fragments might act as a good training set for compressed types. Obviously we could not use our test corpus, since these fragments came from that corpus originally, so we randomly chose subset 9 of the Govdocs corpus for testing. Fragments were created in exactly the same manner as described in chapter 3 except that compressed fragments only were included in the test set. The results are shown in Table 22. Predicted Type bz2 zip rar gz bz2 56.1% 0.0% 28.3% 15.6% zip 46.5% 0.5% 38.8% 14.2% rar 55.2% 0.2% 34.3% 10.2% gz 42.7% 0.0% 41.6% 15.7% Table 22 - SVM Analysis of compressed types P Penrose, MSc Advanced Security and Digital Forensics,

52 Type Actual Type The predictions are heavily biased towards the bz2 type. ZIP files are not being recognised. The number of files classified in each category are given in Table 23. Only 471 files out of 1789 were classified correctly. Predicted Type bz2 zip rar gz bz zip rar gz Table 23 - Number of files classified as each type It is obvious from this that our difficult fragments are not good as a training set for a machine learning algorithm A Useable Classifier Our experiment using an ANN classifier gave more acceptable results than those analysed above. Predicted Encrypted Compressed Accuracy Encrypted % Compressed % Table 24 - ANN Analysis of results These results could be used in a real triage situation. If a quick analysis of a large capacity digital device was done, then we believe that this level of accuracy would be of use to an investigator. Obviously work needs done to improve the compressed classification further, but in the meantime this sets a baseline for future research. 4.6 Limitations There are some points that would benefit from further work in our current implementation. P Penrose, MSc Advanced Security and Digital Forensics,

53 4.6.1 NIST Statistical Analysis Suite We have modified the NIST test suite so that it runs in batch mode. User interaction has been removed and replaced with setting parameters within the software rather than by interaction with the user. The output has been amended so that it appends the results in a suitable format to a CSV file. However the suite is called for each file in our corpus. This means that the suite has to be loaded each time. Although the implementation is reasonably useable at the moment (under 4 minutes for 5000 file fragments) this could be considerably improved by modifying the NIST suite so that it iterated through our test corpus without being called for each file fragment Compressed File Classification Using available compression software is not suitable for our needs. The software which we have used will, in most cases, add to the size of the file rather than compress it. This additionality was explained earlier. In addition, many modern compression programs have a built-in estimator for compressibility and do not compress data that appears will not compress well. Such blocks are simply stored without compression. We had to change our compression method at a late stage of the project because of this Compression Level We have used the default settings with all our compression programs while creating our training and test corpora. Many programs allowing users to choose between speed and degree of compression. A user may choose an ultracompression mode. It has yet to be investigated how this might affect our analysis. 4.7 Conclusions We have classified encrypted fragments with good accuracy. Compressed fragment classification needs further research which is beyond the scope of this thesis. As expected, it was found that classification by compression gave better results for compressed fragments. In our research we used standard compression P Penrose, MSc Advanced Security and Digital Forensics,

54 software. We think that by writing a custom compression application that reports raw compression size, without the inclusion of additional data needed for decompression or archiving, will yield better results. We have implemented our approaches in a manner which makes our results available for other researchers to validate. We have set a baseline in a new area of research against which others can compare methods and results. P Penrose, MSc Advanced Security and Digital Forensics,

55 5 Conclusion 5.1 Overview In this thesis we have developed a methodology for the classification of high entropy file fragments. We have critically evaluated the current state of research in file fragment classification. We found that classification of high entropy file fragments had not been attempted. This area of classification has been stated to be difficult [8], [9], [1], [2]. No attempt had been made at the classification of encrypted fragments. Where classification of compressed fragments has been attempted, results have been poor. We have devised a methodology which classifies encrypted and compressed file fragments. A detection rate of 97% for encrypted fragments and 78% for compressed fragments has been achieved. These results have been achieved in an area where, to date, classification had been thought to be difficult. We have implemented these approaches in a manner which makes our results available for other researchers to validate or for them to do a direct comparison against the methods which we develop. 5.2 Appraisal of Achievements The aim of this thesis was to investigate and devise methods for the classification of high entropy file fragments in a forensic environment. We have achieved our aim. We have devised methods for the classification of compressed and encrypted file fragments. Unlike many of the other papers in the literature we have developed our results using publically available corpora so that results may be validated by others. P Penrose, MSc Advanced Security and Digital Forensics,

56 It is difficult to compare our results with those of others since no other work has been done in this precise area. Our aim was to be achieved by: 1. Critically evaluating the state of the art in file fragment analysis. 2. Devising approaches to the classification of high entropy file fragments. 3. Implementing these approaches in a manner which makes our results available for other researchers to validate or for them to do a direct comparison against the methods which we develop. We appraise our achievement of these objectives below Critical evaluation of the state of the art The current state of the art was critically reviewed and reported on in Chapter 2. It was found that fragment classification is an on-going area of research. We criticised the methodology reported in some papers, and limited understanding of theory in others. By far the majority of research had been done on private or otherwise irreproducible corpora. This creates two major problems. Firstly, the research cannot be validated. Secondly, if researchers can replicate the results they assure themselves that their understanding of the concepts is correct and can move forward with their research from there. Since the research cannot be replicated they cannot benefit from this progression. We have developed our methodology using publically available corpora so that our results are replicable by others. In the literature there were misunderstandings of basic concepts. One paper claimed to be able to identify a PDF fragment with high accuracy. A PDF file is a container file that may contain compressed data, text, images and so on. If a fragment from the file were from a JPEG image contained within the file then it P Penrose, MSc Advanced Security and Digital Forensics,

57 would be difficult to classify as PDF. Several papers used an unsuitable metric seemingly because it had been used in a previous paper. Reporting of results was often poor, with no standardised format. Various interpretations of accuracy were used. It was seen that using confusion matrices was the best method to display results and motivated our use of them in this thesis. We have thus achieved our objective of critically reviewing the state of the art in the subject area. We have conducted our research in such a way as to avoid the criticisms that we have made of others Devising approaches to classifying high entropy file fragments We surveyed many disciplines in the search for methods to classify high entropy fragments. Our basic premise was that encrypted fragments were more complex than compressed fragments. We expected that compression methods balance compression effectiveness against speed and therefore leave some redundancy within the file. Our search for a method to discriminate between these was wide ranging. Standard techniques such as Fourier Transforms were tried to detect any periodicity within fragments [50]. Descriptive statistics such as mean, variance, skewness and kurtosis of byte frequency distributions were tried but provided no discrimination. At this level their appeared to be no difference between the byte frequency distributions of the fragment types. In the field of genetics we found the fractal dimension of a set being used as a discriminator. The fractal dimension is, among other things, a measure of the set complexity [51] [52]. This was found to provide no discrimination between types even when we calculated the set Lacunarity, which acts as a discriminator between sets with a similar fractal dimension [53]. Other methods of measuring complexity were tried. LZ complexity is used in the analysis of bio signals and medical imaging [54] but provided no discrimination P Penrose, MSc Advanced Security and Digital Forensics,

58 when implemented. An approximation to Kolmogorov complexity [55] was tried as one of our standard statistical techniques but provided no discrimination. When viewing the histogram of the byte frequency distribution for different fragments it appeared that the compressed histograms were smoother. This led to investigation of surface smoothness. A statistical test of smoothness of the set was implemented [56] but failed to act as a discriminator. In investigating encryption methods we discovered two things. 1. Encrypted files must be truly random [37]. 2. Random data will not compress significantly [39]. From this we implemented our statistical tests for randomness and compression. Whilst it is impossible to be comprehensive in the search for discriminators, in this paper we examined a wide range of possibilities in a variety of disciplines to arrive at our methodology. We have thus achieved our objective of devising a methodology for discriminating between compressed and encrypted file fragments Implement approaches in a publically verifiable way Our use of publically available corpora and careful explanation of methodology ensures that our work is verifiable. We have thus achieved our third objective. 5.3 Future work We have not optimised the NIST statistical suite for speed. It is already acceptable on 4KB fragments but we modified the suite so that it could be called in batch mode to run against each fragment. This means that each time it is called it has to be loaded into memory and run. If it is modified so that it loads once and iterates through our corpus then a considerable improvement in speed would result. P Penrose, MSc Advanced Security and Digital Forensics,

59 In theory we have shown that the compression of file fragments should lead to accurate classification of compressed and encrypted fragments. In practice, however, the small differences in compressed fragment size which we are looking for are masked by the additional data such as filenames, paths, archive details and journaling entries added by the compression software. In addition many compression algorithms will also produce a dictionary or a Huffman table to be appended to the compressed data for the purposes of decompression. The file fragment which we are classifying has none of this additional data. The compression software which we have used is open source. In order to compare like with like, it should be investigated if the source code for any of the compression programs can be modified to report the raw compressed size without the additional data. We need a program that embeds none of the data needed for decompression nor appends any dictionaries, filenames or archive details. This may achieve results that are closer to those expected by theory In our review of the literature we saw that the trend in file fragment classification is toward specialised approaches. A specialised approach using knowledge of the file format could be used here. Such an approach could be used to pre-process high entropy fragments for evidence of compression. Garfinkel [2] used bit shifting of fragments to detect evidence of Huffman coding as used in the Deflate algorithm. We propose that it should be possible to detect uncompressed data in a fragment of a compressed file by checking for block header (BFINAL 0 0r 1, BTYPE 00) and then checking that the next thirty two bits complied with RFC LEN and NLEN being ones complements [43]. The probability of two zero bits followed by two 16 bit blocks of bits which are ones compliments is which makes it highly improbable. It would therefore make a good identifier. In a similar manner we thought that ZLIB headers might be detected. RFC 1950 defines a 2 byte header for a ZLIB block. The 2 bytes are constrained to be a multiple of 31 when viewed as a 16 bit unsigned integer [57]. We thought that this might be investigated as a possible identifier. There are = possible 32 bit unsigned integers. Of those, / 31 = 2114 are multiples of 31. Hence P Penrose, MSc Advanced Security and Digital Forensics,

60 3% of random byte pairs will be a multiple of 31. We would therefore expect there to be 123 such byte pairs in each 4KB fragment and so this would not be useful as identification of a ZLIB block header. These two classification methods could be built in to the analysis since the fragment bit stream is already being analysed. With such pre-processing used to classify and remove these compressed fragments the overall classification should improve. P Penrose, MSc Advanced Security and Digital Forensics,

61 References [1] V. Roussev and S. L. Garfinkel, File Fragment Classification-The Case for Specialized Approaches, 2009 Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering, pp. 3 14, May [2] S. L. Garfinkel, A. Nelson, D. White, and V. Roussev, Using purpose-built functions and block hashes to enable small block and sub-file forensics, Digital Investigation, vol. 7, pp. S13 S23, Aug [3] S. Garfinkel, Digital forensics research: The next 10 years, Digital Investigation, vol. 7, pp. S64 S73, Aug [4] J. Giordano and C. Macaig, Cyber Forensics : A Military Operations Perspective, International Journal of Digital Evidence, vol. 1, no. 2, pp. 1 13, [5] M. Rogers and J. Goldman, Computer forensics field triage process model, on digital forensics,, pp , [6] S. L. Garfinkel, Random Sampling with Sector Identification, Naval Postgraduate School Presentation, [7] R. Dhanalakshmi and C. Chellappan, File format identification and information extraction, 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC), pp , [8] S. Fitzgerald, G. Mathews, C. Morris, and O. Zhulyn, Using NLP techniques for file fragment classification, Digital Investigation, vol. 9, pp. S44 S49, Aug [9] Q. Li, A. Ong, P. Suganthan, and V. Thing, A novel support vector machine approach to high entropy data fragment classification, Proceedings of the South African Information Security Multi-Conference, [10] M. McDaniel and M. Heydari, Content based file type detection algorithms, Proceedings of the 36th Annual Hawaii International Conference on System Sciences, 2003., [11] W. Li, K. Wang, S. Stolfo, and B. Herzog, Fileprints: Identifying file types by n-gram analysis, Proceedings of the 2005 IEEE Workshop on Information Assurance and Security, pp , P Penrose, MSc Advanced Security and Digital Forensics,

62 [12] K. Wang and S. J. Stolfo, Anomalous Payload-based Network Intrusion Detection, Recent Advances in Intrusion Detection, vol. 3224, pp , [13] E. T. Copson, Metric Spaces, Paperback. Cambridge, England: Cambridge University Press, [14] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, Distance Metric Learning, with Application to Clustering with Side-Information, Learning (2003), vol. 15, no. 2, pp , [15] M. Karresand and N. Shahmehri, Oscar - File Type Identification of Binary Data in Disk Clusters and RAM Pages, Proceedings of the 2006 IEEE Workshop on Information Assurance, vol. 201, pp , [16] M. Karresand and N. Shahmehri, File type identification of data fragments by their binary structure, Information Assurance Workshop, pp , [17] G. Hall and W. P. Davis, Sliding window measurement for file type identification, IEEE Information Assurance Workshop, [18] C. J. Veenman, Statistical Disk Cluster Classification for File Carving, Third International Symposium on Information Assurance and Security, pp , Aug [19] R. F. Erbacher and J. Mulholland, Identification and Localization of Data Types within Large-Scale File Systems, Second International Workshop on Systematic Approaches to Digital Forensic Engineering SADFE07, 55-70, [20] S. J. Moody and R. F. Erbacher, SÁDI - Statistical Analysis for Data Type Identification, 2008 Third International Workshop on Systematic Approaches to Digital Forensic Engineering, pp , May [21] W. C. Calhoun and D. Coles, Predicting the types of file fragments, Digital Investigation, vol. 5, pp. S14 S20, Sep [22] I. Ahmed, K. Lhee, H. Shin, and M. Hong, On Improving the Accuracy and Performance of Content-Based File Type Identification, pp , [23] Optimal Separating Hyperplane, [Online]. Available: SVM. P Penrose, MSc Advanced Security and Digital Forensics,

63 [24] A. Ben-hur and J. Weston, A User s Guide to Support Vector Machines Preliminaries : Linear Classifiers, Methods in Molecular Biology, vol. Oliviero C, pp , [25] S. Axelsson, The Normalised Compression Distance as a file fragment classifier, Digital Investigation, vol. 7, pp. S24 S31, Aug [26] S. Garfinkel, P. Farrell, V. Roussev, and G. Dinolt, Bringing science to digital forensics with standardized forensic corpora, Digital Investigation, vol. 6, pp. S2 S11, Sep [27] R. Cilibrasi and P. Vitanyi, Clustering by Compression, IEEE Transactions on Information Theory, vol. 51, no. 4, pp. 1 28, [28] S. Gopal, Y. Yang, K. Salomatin, and J. Carbonell, Statistical Learning for File-Type Identification, th International Conference on Machine Learning and Applications and Workshops, no. DiiD, pp , Dec [29] A. Özgür, L. Özgür, and T. Güngör, Text Categorization with Class-Based and Corpus-Based Keyword Selection, Proceedings of the 20thInternational Conference on Computer and Information Sciences, pp , [30] L. Sportiello and S. Zanero, File Block Classification by Support Vector Machine, 2011 Sixth International Conference on Availability, Reliability and Security, pp , Aug [31] G. Conti, S. Bratus, A. Shubina, B. Sangster, R. Ragsdale, M. Supan, A. Lichtenberg, and R. Perez-Alemany, Automated mapping of large binary objects using primitive fragment type classification, Digital Investigation, vol. 7, pp. S3 S12, Aug [32] A. Kattan, E. Galván-López, R. Poli, and M. O Neill, GP-fileprints: File types detection using genetic programming, Genetic Programming, pp , [33] B. C. Geiger and G. Kubin, Relative Information Loss in the PCA, in Proceedings IEEE Information Theory Workshop, 2012, pp [34] M. Amirani, M. Toorani, and S. Mihandoost, Feature-based Type Identification of File Fragments, Security and Communication Networks, vol. 6, no. April 2012, pp , [35] W. Chang, B. Fang, X. Yun, S. Wang, X. Yu, and M. Ethodology, Randomness Testing of Compressed Data, Journal of Computing, vol. 2, no. 1, pp , P Penrose, MSc Advanced Security and Digital Forensics,

64 [36] A. Rukhin, J. Soto, and J. Nechvatal, A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications, NIST Special Publication (2010), vol. 22, no. April, [37] J. Soto, Randomness testing of the AES candidate algorithms, NIST. Available via csrc. nist. gov, p. 14, [38] B. Zhao, Q. Liu, and X. Liu, Evaluation of Encrypted Data Identification Methods Based on Randomness Test, 2011 IEEE/ACM International Conference on Green Computing and Communications, pp , Aug [39] J. Ziv, Compression, tests for randomness and estimating the statistical model of an individual sequence, Sequences, [40] B. Schneier, Applied cryptography: protocols, algorithms, and source code in C, 2nd ed. New York: John Wiley & Sons. Inc., 1995, p [41] M. Mahoney, Data Compression Explained, Data Compression Explained - Dell Inc., [Online]. Available: [Accessed: 14-Jul-2012]. [42] E. E. Eiland and L. M. Liebrock, An Application of Information Theory to Intrusion Detection, Fourth IEEE International Workshop on Information Assurance, IWIA 2006., [43] P. Deutsch, RFC DEFLATE Compressed Data Format Specification version 1. 3 IESG, IETF, vol. RFC 1951, pp. 1 15, [44] Microsoft, Cryptography, Crypto API and CAPICOM, Windows Dev Centre, [Online]. Available: [45] D. Kumar, D. Kashyap, K. K. Mishra, and a. K. Misra, Security Vs cost: An issue of multi-objective optimization for choosing PGP algorithms, 2010 International Conference on Computer and Communication Technology (ICCCT), vol. 1, pp , Sep [46] The Advent of Advanced Format, International Disk Drive Equipment and Materials Association, [Online]. Available: [Accessed: 27-Aug-2013]. [47] M. Karresand and N. Shahmehr, Oscar Using Byte Pairs to Find File Type and Camera Make of Data Fragments, EC2ND 2006, P Penrose, MSc Advanced Security and Digital Forensics,

65 [48] Schnaader, Compression Problem, [Online]. Available: [Accessed: 04-Oct- 2012]. [49] C.-C. Chang and C.-J. Lin, LIBSVM : a library for support vector machines, ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 27:1 27:27, [50] D. L. Donoho, M. Vetterli, R. A. Devore, I. Daubechies, and S. Member, Data Compression and Harmonic Analysis, vol. 44, no. 6, pp , [51] K. Sandau, A note on fractal sets and the measurement of fractal dimension, Physica A, vol. 233, pp. 1 18, [52] H. T. Teng, Progress In Electromagnetics Research, PIER 104, , 2010, Progress in Electromagnetics Research, pp , [53] S. Basu and E. Foufoula-Georgiou, Detection of nonlinearity and chaoticity in time series using the transportation distance function, vol. 301, no. September, pp , [54] M. Borowska, E. Oczeretko, A. Mazurek, A. Kitlas, and P. Kuc, Application of the Lempel-Ziv complexity measure to the analysis of biosignals and medical images, Roczniki Akademii Medycznej w Białymstoku, vol. 50, pp. 1 30, [55] A. N. Kolmogorov and V. A. Uspenskii, Algorithms and randomness, Theory of Probability and Its Applications, vol. 32, no. 3, pp , [56] G. N. Srinivasan and G. Shobha, Statistical Texture Analysis, PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, vol. 36, no. December, pp , [57] P. Deutsch and J.-L. Gailly, RFC ZLIB Compressed Data Format, IETF, vol. RFC 1950, pp. 1 10, [58] B. Li, Y. Li, and H. He, LZ Complexity Distance of DNA Sequences and Its Application in Phylogenetic Tree Reconstruction, Genomics Proteomics Bioinformatics, vol. 3, no. 4, [59] R. Begleiter, On Prediction Using Variable Order Markov Models, vol. 22, pp , [60] Y. Wu, J. P. Noonan, and S. Agaian, A novel information entropy based randomness test for image encryption, 2011 IEEE International Conference on Systems, Man, and Cybernetics, pp , Oct P Penrose, MSc Advanced Security and Digital Forensics,

66 [61] R. Lyda and J. Hamrock, Using Entropy Analysis to Find Encrypted and Packed Malware, IEEE Security and Privacy, vol. 5, no. 2, pp , [62] J. Goubault-larrecq, Jean Goubault-Larrecq and Julien Olivain Detecting Subverted Cryptographic Protocols by Entropy Checking, no. June, [63] K. Yim, T. Miller, and L. Faulkner, Chemical characterization via fluorescence spectral files and data compression by Fourier transformation, Analytical Chemistry, vol. 49, no. 13, pp , [64] S. W. Smith, The Scientist and Engineers Guide to Digital Signal Processing, New York: Springer-Verlag, [65] O. Alejandro, H. Reyna, T. Cryptographic, I. Analysis, and F. Computing, Cryptographic Implementations Analysis Toolkit, vol. 02, no. September, [66] M. A. L. I. Soliman and A. M. R. El-helw, Network Intrusion Detection System Using Bloom Filters. [67] S. L. Garfinkel, Automated Media Exploitation Research : History and Current Projects NPS is the Navy's Research University. Schools :, [68] C. Corbit and D. Garbary, Fractal Dimension as a quantitative measure of complexity in plant development, Proceedings: Biological Sciences, vol. 262, no. 1363, [69] D. Dubé and V. Beaudoin, Lossless Data Compression via Substring Enumeration, 2010 Data Compression Conference, pp , [70] T. Tao and A. Mukherjee, Pattern matching in LZW compressed files, Computers, IEEE Transactions on, vol. 54, no. 8, pp , [71] P. Farrell, S. L. Garfinkel, and D. White, Practical Applications of Bloom Filters to the NIST RDS and Hard Drive Triage, 2008 Annual Computer Security Applications Conference (ACSAC), pp , Dec [72] E. Casey, G. Fellows, M. Geiger, and G. Stellatos, The growing impact of full disk encryption on digital forensics, Digital Investigation, vol. 8, no. 2, pp , Nov [73] I. Ahmed, K. Lhee, H. Shin, and M. Hong, FAST CONTENT-BASED File type identification, Proceedings of the 25th ACM Symposium on Applied Computing, pp , P Penrose, MSc Advanced Security and Digital Forensics,

67 [74] I. Witten, R. Neal, and J. Cleary, Arithmetic coding for data compression, Communications of the ACM, vol. 30, no. 6, [75] A. Basile, Random versus Encrypted Data, [Online]. Available: [76] V. Koonaparaju, Statistical Test for Randomness, [Online]. Available: _report.pdf. [Accessed: 03-Oct-2012]. [77] a. Lempel and J. Ziv, On the Complexity of Finite Sequences, IEEE Transactions on Information Theory, vol. 22, no. 1, pp , Jan [78] X. Lin, C. Zhang, and T. Dule, On Achieving Encrypted File Recovery, Forensics in Telecommunications, Information and Multimedia, pp. 1 13, [79] M. Mahoney, The ZPAQ Open Standard Format for Highly Compressed Data - Level 1 Candidate, [80] G. Marsaglia and W. Tsang, Some difficult-to-pass tests of randomness, Journal of Statistical Software, pp. 1 9, [81] C. Monz, Machine Learning for Data Mining Support Vector Machines ( SVMs ) Separating Hyperplanes Boundaries SVM Hyperplanes. [Online]. Available: 4pp.pdf. [82] K. Nance, B. Hay, and M. Bishop, Digital forensics: defining a research agenda, System Sciences, HICSS 09., pp. 1 6, [83] J. C. Hernandez and P. Isasi, Finding Efficient Distinguishers for Cryptographic Mappings, with an Application to the Block Cipher TEA, Computational Intelligence, vol. 20, no. 3, pp , [84] B. Park, A. Savoldi, P. Gubian, J. Park, S. H. Lee, and S. Lee, Recovery of Damaged Compressed Files for Digital Forensic Purposes, 2008 International Conference on Multimedia and Ubiquitous Engineering (mue 2008), pp , [85] V. Roussev, Y. Chen, T. Bourg, and G. G. Richard, md5bloom: Forensic filesystem hashing revisited, Digital Investigation, vol. 3, pp , Sep P Penrose, MSc Advanced Security and Digital Forensics,

68 [86] S. W. Smith, The Scientist and Engineers Guide to Digital Signal Processing, in The Scientist and Engineers Guide to Digital Signal Processing, San Diego: California Technical Publishing, [87] E. Thambiraja, G. Ramesh, and D. Umarani, A Survey on Various Most Common Encryption Techniques, International Journal, vol. 2, no. 7, pp , [88] B. Vanschoenwinkel and B. Manderick, Appropriate kernel functions for support vector machine learning with sequences of symbolic data, Methods in Machine Learning, pp , [89] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, Distance metric learning with application to clustering with side-information, Learning, vol. 15, no. 2, pp , [90] A. Yazdanpanah and M. R. Hashemi, A new compression ratio prediction algorithm for hardware implementations of LZW data compression, th CSI International Symposium on Computer Architecture and Digital Systems, pp , Sep [91] G. P. Zhang, Neural networks for classification: a survey, IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), vol. 30, no. 4, pp , [92] J. M. Amigó, J. Szczepański, E. Wajnryb, and M. V Sanchez-Vives, Estimating the entropy rate of spike trains via Lempel-Ziv complexity., Neural computation, vol. 16, no. 4, pp , Apr [69] [12] [70] [60] [71] [33] [62][55] [72][73][74][75] [76] [45] [77] [58][78] [61] [41][79] [80] [44] [81][82] [83] [83][29] [84] [85] [51][40] [86] [56][70][87][88] [89][90][63] [91][38][39][92] P Penrose, MSc Advanced Security and Digital Forensics,

69 Appendix 1 Failed Discriminators A variety of methods were tried before we arrived at a solution. The methods described here all failed to differentiate significantly between encrypted and compressed fragments. This may help future researchers avoid wasted effort. Software snippets of some of the code developments are given in a later appendix. Initially we started with the usual statistical analysis of the byte frequency distribution (BFD). None of the standard descriptive statistics provided any differentiation. Statistics used included mean, variance, skewness and kurtosis. In addition a truly random sequence should have a uniform distribution with the probability of any byte value being equal. If one particular byte value was more probable than others then the sequence is not random. The chi square goodness of fit test was used to check the fit of the byte frequency distribution to a uniform distribution. Kuipers test was also used as a test to disprove the hypothesis that the BFD came from a uniform distribution. The complexity of the sequence was tested using both LZ Complexity [27], [58], [59] and an approximation to Kolmogorov complexity [55]. The serial correlation coefficient was calculated. This tests how each byte in the fragment depends on the previous byte. The value should be close to zero for a random file. Another test was devised to check for predictability. A first order prediction model as used in PPM compression was coded. If the fragment is random, then the predictions of this model should be correct about 1/256 of the time. Fragment entropy [9], [52], [60 62] was calculated but failed to give any discrimination at all. The Fast Fourier Transform [50], [63 65] was used to test for any periodicity in the data but none appeared. P Penrose, MSc Advanced Security and Digital Forensics,

The normalised compression distance was tested [25]. We used the LZ complexity distance [58] and measured distance of fragments from a random string generated from atmospheric noise at Random.Org.

70 The normalised compression distance was tested [25]. We used the LZ complexity distance [58] and measured distance of fragments from a random string generated from atmospheric noise at Random.Org. We hypothesised that the compressed string distance from the random string would be greater than the encrypted string. This turned out not to be true. We also applied the cosine similarity metric [8], [66], [67] to these strings with no better effect. There seemed to be a visual difference between the byte BFDs of encrypted and compressed files when looking at the histograms of the BFDs. Figure 5 - Compressed file visualisation Figure 6 - Encrypted file visualisation We investigated measures of surface smoothness and found that the fractal dimension of a set could be used as a discriminator for texture [51, p.222], [55]. In addition, fractal dimension can be used as a measure of complexity [68]. No discrimination was found between types. Lacunarity is a measure of and can be used to discriminate between sets of similar fractal dimension. Calculation of Lacunarity was implemented with no greater success. P Penrose, MSc Advanced Security and Digital Forensics,

71 In [2] bit shifting and auto-correlation was done on fragments and the cosine similarity between bit shifted blocks was used to detect Huffman coding. This was implemented with no success. The algorithm given in [2] is not well specified and we implemented each possible interpretation of it. A bijective compressor has the property that any possible sequence of bytes could be a possible output [41]. This means that any possible sequence can be uncompressed. The sequence produced is likely to be meaningless if it was not originally produced by the compressor, but nevertheless the decompression will not fail. We decided to use a bijective decompressor on our fragments to see if there would be a significant difference between the decompressed size of encrypted and compressed fragments. We achieved no significant differentiation. P Penrose, MSc Advanced Security and Digital Forensics,

72 Appendix 2 Code Snippets Wherever possible we used the.net environment. VB.NET was used whenever possible. C#, C++ and C were used when necessary. The code snippet below was developed to calculate the fractal dimension of a set. The box counting algorithm used was developed from [51]. Function FractalDim(RA As Byte()) As Double Dim Points_X(4) As Double ' Holds calculated data points Dim Points_Y(4) As Double Dim box_width, box_height, box_count, i, j, k, l As Integer Dim boxes_along, boxes_up, start_of_box, end_of_box As Integer Dim low_val, high_val As Integer For i = 8 To 4 Step -1 ' Box widths 2^8 (256) to 2^4 (16) box_width = 2 ^ i box_height = 2 ^ (i - 4) 'Heights 2^4 (256) to 2^0 (1) boxes_along = 4096 / box_width boxes_up = 256 / box_height box_count = 0 For j = 1 To boxes_along start_of_box = (j - 1) * box_width end_of_box = start_of_box + box_width - 1 For k = 1 To boxes_up low_val = (k - 1) * box_height high_val = low_val + box_height For l = start_of_box To end_of_box If RA(l) >= low_val And RA(l) < high_val Then box_count += 1 ' Point in box Exit For End If Next Next Next 'Got number of boxes that 'cover' the graph = box_count = N(s) 'Use box_width as s 'Need to store logn(s) and log(s) 'Fractal dimension D is the slope of the best line fit to log(n(s))/ log(s) plot Points_Y(Math.Abs(-8 + i)) = Math.Log(box_count) 'abs(-8 + i) gives 0, 1, 2, 3, 4 Points_X(Math.Abs(-8 + i)) = Math.Log(box_width) Next 'Calculate gradient of best fit line of log(n(s))/log(s) plot 'Use Regression Class from Dim R As New Regression() Dim regressionstats As New Regression.RegressionProcessInfo regressionstats = R.Regress(Points_X, Points_Y) Return Math.Abs(regressionstats.b) End Function The following code for calculating the LZ Complexity of a byte array was developed from an algorithm in [54]. Private Function LZ_Complexity(S As String) As Double P Penrose, MSc Advanced Security and Digital Forensics,

73 Dim n, k, counter, m_upper, m_lower As Integer Dim gs(8192) As Integer 'ReDim gs(4096) 'Theoretically possible n eigenvectors Dim eigenvalue_found As Boolean Dim idx_list() As Integer 'idx_list() can potentially go from gs(1) value (=0) to n (=4096) Dim user_input As String Dim h_i(4096) As Integer 'Production history Dim gs_length As Integer = 0 ' number of eigenvectors 'First calculate eigenfunction = up to n eigenvalues n = Len(S) gs(1) = 0 gs(2) = 1 For n = 2 To Len(S) eigenvalue_found = False 'Create idx_list() using 1 basis for array - ignore 0 element ReDim idx_list(n - gs(n)) For counter = 1 To n - gs(n) idx_list(counter) = counter + gs(n) Next counter For k = 1 To Math.Ceiling((n - gs(n)) / 2) m_upper = idx_list(n - gs(n) - k + 1) If InStr(Strings.Left(S, n - 1), Mid(S, m_upper, n - m_upper + 1)) = 0 Then gs(n + 1) = m_upper eigenvalue_found = True gs_length += 1 Exit For End If m_lower = idx_list(k) If InStr(Strings.Left(S, n - 1), Mid(S, m_lower, n - m_lower + 1)) <> 0 Then gs(n + 1) = m_lower - 1 eigenvalue_found = True gs_length += 1 Exit For ElseIf m_upper = m_lower + 1 Then gs(n + 1) = m_lower eigenvalue_found = True gs_length += 1 Exit For End If Next k If Not eigenvalue_found Then user_input = MsgBox("Must be something wrong - could not find eigenvalue.", vbokonly, "Calculate LZ Complexity") End If Next n Dim gslength As Integer = gs.length ReDim Preserve gs(gs_length) Dim h_i_length As Integer = 1 P Penrose, MSc Advanced Security and Digital Forensics,

74 'Exhaustive history Dim h_prev As Integer = 0 k = 1 While k <> 0 And h_prev + k < 4096 k = find_h(gs, gs_length, h_prev) 'Finds index i, i > h_prev + 1, in gs() - where gs(i) > h_prev If k <> 0 Then h_i_length += 1 h_prev = h_prev + k h_i(h_i_length) = h_prev End If End While If h_i(h_i_length) <> Len(S) Then h_i_length += 1 h_i(h_i_length) = Len(S) End If Dim H As String = "" Dim start, H_length As Integer For k = 1 To h_i_length - 1 start = h_i(k) + 1 H_length = h_i(k + 1) - h_i(k) H = H & Mid(S, start, H_length) Next Return Len(H) End Function In [2] bit shifting and auto-correlation was done on fragments and the cosine similarity between bit shifted blocks was used to detect Huffman coding. This code was developed according to the algorithm. The algorithm given is not well specified and we implemented each possible interpretation of it. One version is given below. For Each FileItem As Object In FileList FileExt = FileItem.ToString.Substring(FileItem.ToString.Length - 3, 3) FileName = FileItem.ToString filecount += 1 FileChunk = My.Computer.FileSystem.ReadAllBytes(FileItem) Shift = FileChunk For ShiftBits = 1 To 8 'Rotate array one bit Shift = ShiftRight(Shift) 'Calculate number of set bits No1s = 0 For counter = 0 To 4095 Diff = Abs(Val(FileChunk(counter)) - Val(Shift(counter))) Difference(counter) = Diff No1s = No1s + OnBits(Difference(counter)) Next 'Calculate 0/1 ratio ZeroOneRatio(ShiftBits - 1) = ( No1s) / 'Calculate Cosine similarity AxB = 0 P Penrose, MSc Advanced Security and Digital Forensics,

75 Ai2 = 0 Bi2 = 0 For counter = 0 To 4095 AxB = AxB + Val(FileChunk(counter)) * Val(Difference(counter)) Ai2 = Ai2 + Val(FileChunk(counter)) * Val(FileChunk(counter)) Bi2 = Bi2 + Val(Difference(counter)) * Val(Difference(counter)) Next Similarity(ShiftBits - 1) = AxB / (Sqrt(Ai2) * Sqrt(Bi2)) Next ShiftBits 'Calculate mean and SD of zero / one ratio. Next End Sub Public Function ShiftLeft(ByVal RA() As Byte) As Byte() Dim Left(4095) As Byte Dim counter, carrybit As Integer carrybit = 0 For counter = 4095 To 0 Step -1 Left(counter) = (RA(counter) And 127) * 2 Or carrybit If RA(counter) And 128 Then carrybit = 1 Else carrybit = 0 End If Next Left(4095) = Left(4095) + carrybit ' First bit moved to end of array Return Left End Function Public Function ShiftRight(ByVal RA() As Byte) As Byte() Dim Rt(4095) As Byte Dim counter, carrybit As Integer carrybit = 0 For counter = 0 To 4095 Rt(counter) = (RA(counter) / 2) Or carrybit If RA(counter) And 1 Then carrybit = 128 Else carrybit = 0 End If Next Rt(0) = Rt(0) + carrybit ' First bit moved to end of array Return Rt End Function P Penrose, MSc Advanced Security and Digital Forensics,

76 P Penrose, MSc Advanced Security and Digital Forensics,

77 Appendix 3 Project Planning Gantt Charts Planned Actual P Penrose, MSc Advanced Security and Digital Forensics,

78 Appendix 4 Project Diary EDINBURGH NAPIER UNIVERSITY SCHOOL OF COMPUTING PROJECT DIARY Student: Phil Penrose Buchanan Supervisor: Rich Macfarlane/ Prof Bill Date: 10/07/2012 Last diary date: Objectives: Need to tighten up the main aim and objectives of my proposal. Is the Kolmogorov complexity measure feasible i.e. is my assertion that the byte distribution of encrypted files should be random a realistic one. Research needed. In the meantime the literature review can be started. After meeting with Rich, implement some ideas to tighten up organisation of my work install and use Mendeley Desktop and Word extension, install and use DropBox for my dissertation files. Need to create a project plan - try to be realistic. Need to keep a project diary. Try to keep it current. What should go in the project diary? Should I put in e.g. Working on Literature review as a week s entry or should I list papers and Journal articles for the week? Progress: First diary entry (this one) created. Mendeley desktop installed and used wish I had known about this years ago. It would have saved many hours while doing coursework. Microsoft Project used to create project plan. Dropbox installed and shared folder created for Dissertation. Initial research seems to support my assertion that cypher text is random. NIST has recently (2010) published a comprehensive suite of randomness tests: A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications - available at ( This test suite was used to evaluate submissions to the AES contest for randomness. One of the criteria used to evaluate the AES candidate algorithms was their demonstrated suitability as random number generators. That is, the evaluation P Penrose, MSc Advanced Security and Digital Forensics,

79 of their outputs utilizing statistical tests should not provide any means by which to computationally distinguish them from truly random sources. Soto, J. (NIST) - available at =pdf Supervisor s Comments: EDINBURGH NAPIER UNIVERSITY SCHOOL OF COMPUTING PROJECT DIARY Student: Phil Penrose Buchanan Supervisor: Rich Macfarlane/ Prof Bill Date: Week Commencing 16/07/2012 Last diary date: 10/07/2012 Objectives: Skype meeting with Rich Tuesday 17/07/12 My lack of direction discussed. Agreed on work that could be continued with other than continued research. Start dissertation document. Use word template at the moment. Paragraphs re introduction to the subject area and how it fits in to the Digital Forensics landscape. Continue research, including related off topic areas such as the paper on recognition of digital audio fragments for copyright purposes that I m reading at the moment. Investigate academic phrasebook at Imed re confirmation of submission dates and dates for module introductory lectures in Trimester 1. Copy to Rich and Gordon Russell (new module leader). Continue with research in the sure and certain knowledge that a purpose will appear! Progress: Started research on Bloom filters. It soon became clear that Bloom filters and hash tables were of use in identifying known content. This would not be relevant to large disc triage, as I envisage large storage device sampling and file fragment P Penrose, MSc Advanced Security and Digital Forensics,

80 identification. The digital audio fragment recognition was interesting but again keyed to identifying known information on disc. s done. Still not happy with administration at Napier. If not for Rich I d be lost. Phrasebank now in my favourites looks like it will be very useful. Supervisor s Comments: P Penrose, MSc Advanced Security and Digital Forensics,

81 Student: Phil Penrose Supervisor: Rich Macfarlane Date: Week Commencing 23/07/2012 Last diary date: 16/07/2012 Objectives: No comments re previous work so I ll just carry on. My ideas have firmed up on the classification of file fragments. This is what is important when statistically sampling large capacity devices see Garfinkel Will research file fragment classification in the coming week. Progress: There is actually a limited amount of research on this topic although the feeling seems to be that it is important not only as a means of triage but also for file carving. Most methods seem to be based on analysis of the byte frequency distribution of the fragment. Analysis methods vary. Some use simple statistical analysis such as mean, variance, skewness and curtosis ( the first 3 central moments about the mean). Some use the byte frequencies themselves to define a vector which is supposed to characterise the fragment. Compressed files are rarely part of testing fragments. These seem difficult to classify. Encrypted files seem beyond the pale! Wasted a week on learning C to grab sectors from disc. However if I random sample disc sectors then I have no control over their content for testing my discrimination methods later on! Then realised that I can simulate the sectors and have control over the process in.net. Developed software to extract a random 4KB block from a file excluding firsta and last blocks to exclude headers at the start and packing at the end. P Penrose, MSc Advanced Security and Digital Forensics,

82 Student: Phil Penrose Supervisor: Rich Macfarlane Date: Week Commencing 13/08/2012 Last diary date: 23/07/2012 Objectives: Start literature review of file fragment classification more formally. Use Mendeley as my referencing tool. Continue research into classification methods. Investigate statistical methods of classification. Create test corpus for development of software. Progress: Read and digested several papers. I like the way that they can be imported and viewed in Mendeley. The search facility is good as well being able to search through all my saved papers. Reading these papers is something of an eye opener. They test their methods by using such corpora as did a random search on Google for files of the correct type! Read a paper by Garfinkel et al (Bringing science to digital forensics with standardized forensic corpora 2009) which made the obvious case for using standardised corpora in research. The authors had created such corpora which are publically available so have decided to use it for testing in any method that I develop. Apart from central moments, a number of statistical tests have been used. List includes Kolmogorov Complexity (which on investigation is not computable!), LZ Complexity, Entropy, Normalised Compression Difference. Downloaded GovDocs 1 Subset 0 for testing the software and methodology of creating a test corpus. ( ) P Penrose, MSc Advanced Security and Digital Forensics,

83 Student: Phil Penrose Supervisor: Rich Macfarlane Date: Week Commencing 10/09/2012 Last diary date: 13/08/2012 Objectives: Holidays intervene! I need to test my methods of creating my corpus. I need to test the statistical tests identified so far to see if they can discriminate between compressed and encrypted file fragments Progress: Used a selection of files from GovDocs1 corpus. Compressed them using 7Zip as gzip and bzip2 files. Encrypted the files (Using Blowfish tool) as AES and 3DES. Tested the random fragment creation software. It works fine. Developed software to calculate the statistics above as well as Entropy and a Chi Square. Assumed a uniform distribution for a random or encrypted file fragment for the chi-square test. Created analysis tables in Excel. I cannot get better than 60% detection rate with these statistics. I need to look for different methods for classification. The above all took up 2 weeks. P Penrose, MSc Advanced Security and Digital Forensics,

Student: 10002505 Phil Penrose Supervisor: Rich Macfarlane Date: Week Commencing24/09/2012 Last diary date: 10/09 2012 Objectives: Continue with literature review notes.

My only worry is that at the end I might just have negative findings to report. Continue research on possible methods to differentiate compressed and encrypted fragments.

84 Student: Phil Penrose Supervisor: Rich Macfarlane Date: Week Commencing24/09/2012 Last diary date: 10/ Objectives: Continue with literature review notes. Now that I ve solidly plumped for classification of compressed and encrypted fragments I can start writing up the introduction. My only worry is that at the end I might just have negative findings to report. Continue research on possible methods to differentiate compressed and encrypted fragments. Progress: The notes for the literature review are progressing. I wonder sometimes how these papers get published! Some are excellent, but some make simple errors in either design of experiments or analysis of results. I will try to avoid this. Wrote up an introduction but am not happy with it. Will review the advice on the Dissertation web site on how to write a dissertation. Installed Python to run some software that I found for producing a visual representation of a file. When you look at the distribution it seems that encrypted files are jaggier or rougher than compressed ones or vice versa. Investigate. P Penrose, MSc Advanced Security and Digital Forensics,

A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification

A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification Q. Li 1, A. Ong 2, P. Suganthan 2 and V. Thing 1 1 Cryptography & Security Dept., Institute for Infocomm Research, Singapore