ACCELERATION OF DIGITAL FORENSICS FUNCTIONS USING A GPU. A Project. Presented to the faculty of the Department of Computer Science

Size: px

Start display at page:

Download "ACCELERATION OF DIGITAL FORENSICS FUNCTIONS USING A GPU. A Project. Presented to the faculty of the Department of Computer Science"

Dora Hillary McLaughlin
5 years ago
Views:

1 ACCELERATION OF DIGITAL FORENSICS FUNCTIONS USING A GPU A Project Presented to the faculty of the Department of Computer Science California State University, Sacramento Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in Computer Science by Kristofer Carlos Robles SPRING 2017

3 ACCELERATION OF DIGITAL FORENSICS FUNCTIONS USING A GPU A Project by Kristofer Carlos Robles Approved by:, Committee Chair Dr. Pinar Muyan-Ozcelik Date iii

4 Student: Kristofer Carlos Robles I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the project., Graduate Coordinator Dr. Jinsong Ouyang Date Department of Computer Science iv

5 Abstract of ACCELERATION OF DIGITAL FORENSICS FUNCTIONS USING A GPU by Kristofer Carlos Robles When file system metadata is corrupted, missing, or otherwise unreliable, file carving is the strategy used to recover files from a data volume. Difficulties arise when files within the data volume are stored in more than one fragment. These difficulties are compounded by the very large data volumes which are common today. This project shows how the methods used to address these difficulties can be greatly accelerated using parallel algorithms executed on a Graphics Processing Unit (GPU) which has a massively parallel architecture. The work of previous researchers on the same problem, largely conducted in the context of sequential algorithms and hardware, is used to guide the parallel implementation. Functions for pattern search, histogram, normalization, and statistical calculations are evaluated. With little to no optimization effort speedups of 11X - 82X are achieved by parallel implementations over sequential implementations in these functions which are used during file carving operations. Data sets containing fragmented files authored specifically for evaluating file carving tools are used to evaluate and compare implementations., Committee Chair Dr. Pinar Muyan-Ozcelik Date v

6 TABLE OF CONTENTS Page List of Tables... viii Chapter 1. INTRODUCTION BACKGROUND... 3 A. GPU Computing... 3 B. Digital Forensics... 4 C. File Carving Structure-Based Carving Content-Based Carving Other Carving Strategies REVIEW OF RELATED WORK CONCEPTUAL MODEL AND DESIGN PROPERTIES OF DATA EXPERIMENTAL RESULTS A. Pattern Search B. Histogram C. Normalization D. Correlation Coefficients, Standard Deviation, Mean COMPARISON WITH RELATED WORK CONCLUSION vi

7 9. FUTURE WORK Appendix A. Hardware Configuration References vii

8 Tables LIST OF TABLES Page 1. Average search runtime using L0_Graphic.dd sample Average histogram runtime using L0_Graphic.dd sample Average normalization runtime using L0_Graphic.dd sample Average implementation runtime using L0_Graphic.dd sample. 29 viii

9 1 I. INTRODUCTION Digital forensic investigation has grown in importance as the global population continues to become more technology dependent and interconnected. These days it is difficult for any type of crime to occur without leaving at least some trace of digital evidence such as geolocation, communications, images, videos, and more. This information can be used to confirm or deny alibis and associations or in some cases to directly show culpability. In the world of computer crimes sometimes the digital evidence is the only evidence available to investigators. Many of the same digital forensic tools and methods developed for use in criminal investigations are seeing use daily in data recovery, computer intrusion investigations, and computer attack attribution. The forensic examination of digital data volumes requires understanding what the volume contains, for example file names, file types, file history metadata, and the files themselves. When the file system metadata, such as the File Allocation Table (FAT) in a FAT32 formatted volume, is corrupted, missing, or otherwise unreliable the process of retrieving this information is referred to as file carving. File carving is problematic and becomes very difficult when files are stored in multiple fragments within the data volume. While some implementations have been successful at achieving accurate file recovery or fast execution speed it is difficult to achieve both simultaneously. This project aims to utilize a GPU, an often-underused resource of computational power in non-graphics applications, to address these problems. Consumer grade GPU hardware uses massively parallel architectures and are relatively cheap compared to other options of similar computational power. This project splits the computational

10 2 load of file carving between a CPU and a GPU to show that the functions used in file carvers can be greatly accelerated when executed as parallel algorithms on a GPU. The specific hardware configuration is listed in Appendix A. The hardware was chosen to demonstrate what can be expected from a low-power and low-cost implementation as many law enforcement agencies and information security teams may be working within tight budget constraints.

11 3 II. BACKGROUND A. GPU Computing This project shows how the use of parallel algorithms developed to run on a consumer grade GPU can greatly accelerate the functions used in file carving operations. Although it is common for games and graphics programs to make great use of GPU hardware it is an often-underutilized resource of additional compute power for non-graphics applications. GPU technology continues to push forward as higher display resolutions and virtual reality become mainstream in the marketplace. Common hardware such as the NVIDIA GTX 1080 used in this project contain 2560 compute cores and large amounts of dedicated memory. Manufacturers such as NVIDIA and organizations such as the Heterogeneous System Architecture (HSA) Foundation have also been improving the software interfaces to GPU hardware making utilization easier than ever for application programmers. This project uses NVIDIA s Compute Unified Device Architecture (CUDA) platform [18] which will feel familiar to programmers who have used C++. GPUs were designed for highly parallel graphical processing tasks and ever since NVIDIA released the world s first GPU in 1999 they have always included large numbers of relatively low-power processing units working in parallel [19]. Researchers noticed these resources and began using them to solve non-graphical problems even before the hardware producers provided a convenient application programming interface (API), which may have even driven GPU makers to make more of the hardware components of GPUs programmable over time. This early work likely helped to drive interest in and development of new applications of GPU hardware and improved programming interfaces. The market demand for specialized GPU computing hardware is high enough today that NVIDIA has taken a modular approach of building some GPU

12 4 configurations specifically for compute-intensive applications and others for graphics applications using the same underlying architecture components [20]. B. Digital Forensics Given the challenges faced by digital forensics investigators today it is natural to ask if this underutilized and economical resource could be applied. The background material on digital forensics history, process, and the state of the art has been covered extensively and will not be reintroduced here [2][14]. This project will focus on the second stage of a forensic inquiry, analysis, which can be one of the more computationally intense stages of a forensic inquiry. During the first stage of a typical forensic examination a copy of the digital data volume being examined is made to be analyzed without risk of modifying the original. In most literature this copy of data is referred to as the image or search image as it is often collected using the same techniques used to image a hard disk drive (HDD). Once acquired and validated against the original volume using a cryptographic hash the second stage, analysis, begins. The first task during analysis is to understand what the image contains. In the context of a consumer PC analysis aims to retrieve file names, types, history metadata, and the files themselves in a machine-readable format. In the best case scenario the file system metadata of the image is available and these aims can be achieved with accuracy and confidence. When the file system metadata is corrupted, missing, or suspected to be unreliable due to tampering achieving this understanding of the image contents becomes much more complicated. In such a scenario file carving is used to analyze the image.

13 5 C. File Carving File carving attempts to recover the files from a search image without the use of file system metadata. Under these conditions the best chance of successful recovery of a given file is possible when the file is stored in one contiguous segment within the image. When a file is stored as two or more segments within the image, known as fragmentation, successful recovery becomes much more difficult. Many file carving implementations struggle to recover fragmented files accurately or do not scale well enough to be practical given the enormous data volumes common today. Of primary concern in both traditional criminal investigations and computer emergency response is the accuracy of a given forensic method and its speed of execution. Inaccurate or untimely evidence has little to no value in these pursuits. It has been shown in previous research these two attributes are deeply intertwined in file carving implementations [11]. There are several file carving strategies proposed in existing research and the major categories are described below. Some of these strategies are used in existing commercial and open source file carvers. Laurenson s evaluation of available carving software shows that achieving both speed and accuracy in implementation is difficult and that the difficulty increases if the files are stored in more than one fragment [11]. This result is especially important for investigators to understand as user modified files, typically the most relevant to a forensic inquiry, have been shown the most likely to be fragmented [8]. Taken together the investigator today is often left with incomplete, inaccurate, or time intensive analysis and in the worst case may suffer all three. 1) Structure-Based Carving Structure-based carving was one of the first file carving strategies proposed and was implemented in tools such as Foremost and Scalpel [23]. During structure-based carving the data volume is

14 6 searched for known values or data structures associated with specific file types. A common example, also used later in this project, is the header and footer search. Some file types have a specific byte sequence which acts as the effective beginning and end of the file, the so-called header and footer. A simple carver using this strategy finds the memory cluster containing a header, finds the memory cluster containing an associated footer, and assumes that the clusters in between the two (inclusive) belong to the file to be recovered. Note that this assumption is only true if the file was stored as one contiguous memory segment, with no other allocated or empty clusters in between the header and footer, and that such a simple strategy will fail to recover fragmented files. There are other problems with the simple structure-based approach. It requires knowing the byte sequences, data structures, or headers and footers of each file type to search for ahead of time. Already there are a multitude of these magic numbers. If these magic numbers are short, for example three bytes long, it is possible the byte sequence would be a common occurrence within the data volume and this would cause many false positives. Furthermore the application will need to be updated any time a new sequence is put into use by new file types. There is also the possibility a file type doesn t use such byte sequences or that a malicious actor could spoof the byte sequences to reduce risk of detection. String search algorithms play an important role in structure-based carving strategies. Liao provides an excellent analysis of different string search algorithms [13]. Other research indicates Boyer-Moore and Aho-Corasick string matching algorithms are potential candidates for parallel string searching [25].

15 7 2) Content-Based Carving Content-based carving grew from the solutions developed for content-based file identification. The problems inherent in structure-based file identification, which are very similar to the problems presented for structure-based file carvers, motivated content-based identification which was introduced by McDaniel and Heydari [16] and later built upon by others [12][10][26]. The general idea of content-based identification is to characterize a file type based on the typical contents of such a file, then compare an unknown file to the known characterizations to determine what it is. The general strategies developed for file-type identification were applied to file fragment reassembly [3][7]. The byte values of fragments or sectors are collected, typically as normalized frequency distributions and statistical measures applied to try and make judgements about the type of data in the fragment. Calhoun and Coles used mean, modes, standard deviation, entropy, in addition to the longest common subsequence of fragments to determine fragment type [3]. Fitzgerald et al. gather the statistical measures of known files to create feature vectors and train a support vector machine which is then used to classify unknown fragments with accuracy as high as 99.7% in the best case [7]. McDaniel and Heydari noted there are some file types without strong patterns to characterize and increased the accuracy of identification by combining structure-based identification methods with content-based methods [16]. Other research utilizes Principal Component Analysis (PCA) and unsupervised neural networks to conduct content-based file identification [1].

16 8 3) Other Carving Strategies Another strategy involves assembling the fragments and attempting to open them in the appropriate application, if the file opens it is assumed not to be a corrupted false-positive [8]. Other research describes carving as a mapping function between the image data and the file data allowing the problem to become a function optimization problem [5]. The problem has also been described as an optimal path construction problem [27]. Context-based carving is another proposed approach. In it the fragments are first put into categories representing the files they would belong to such as text or executables. Then the fragments of similar type are reassembled based on probabilities that a given fragment would follow the preceding fragment [24]. SmartCarver is another solution to file carving. SmartCarver reduces the amount of data to be processed in the carving stage by removing any disk cluster which is known to be allocated and only carving unallocated clusters [21]. This is a slightly different problem than situations where no file system is present, however it is interesting to see that file carving is needed even in cases where the file system is available.

17 9 III. REVIEW OF RELATED WORK Some research and implementations are available showing promising results are possible when using GPUs to conduct digital forensics. Scalpel is one structure-based carver which utilizes header and footer values and one of the few file carvers already implemented on GPU hardware using the CUDA programming language [15]. Unfortunately, speeding up Scalpel by running it on a GPU does not address the problems inherent in structure based-carving. In Laurenson s performance measurement of Scalpel it carved many thousands of false-positives when confronted with highly fragmented data [11]. Other researchers focused on GPU acceleration of digital forensics keyword searches [4]. When conducting an inquiry it is often helpful to search for keywords related to the matter being investigated. Sequential searches of large data volumes can be very time consuming so a GPU accelerated search was proposed. The implementation showed a potential speedup of 100X during experimentation [4]. A related task of finding known contraband images is also very time consuming when the search image is many gigabytes in length. Researchers proposed using a GPU to take hashes of file data in the volume and in order to compare them to file hashes of known contraband images [6]. These examples show promise but narrowly focus on various forms and applications of pattern searching. This project intends to go further by showing how functions used by advanced carving strategies can also be accelerated using a GPU.

18 10 IV. CONCEPTUAL MODEL AND DESIGN This project aims to show how massive parallelization of functions used during the carving process can achieve faster execution times. Functions representing core functionality of the most effective known file carving methods were chosen for evaluation (e.g., pattern search, histogram, normalization, correlation, mean, and standard deviation). By choosing specific functions, such as the header and footer search function, a clear comparison between a sequential version and a parallel version can be made. This will help in identifying areas with the most potential for improvement through parallelization. This strategy implies certain consequences. For any comparison of sequential and parallel functions to be valid the functions should be roughly equivalent in design. As an example it would be unfair to compare a brute force sequential pattern search with a parallel implementation of the Boyer-Moore pattern search algorithm. The Boyer-Moore search algorithm takes advantage of a unique property of pattern searching in order to skip unnecessary evaluations, and generally gets faster as the pattern being searched for increases in length. On the other hand a brute force searching is a very simple design and skips no evaluations. Comparing the two would introduce the differences in algorithm strategies into the results. This project aims to focus on the differences between the types of hardware executing the algorithms. The parallel implementations necessarily will differ in key aspects, however, they were designed to be comparable in strategy to the sequential counterparts. It would also be unfair to look for optimization opportunities in one side of the comparison and not the other. Every attempt is made to keep the comparison as valid as possible by coding the functions specifically for this project. Furthermore compiler optimization is disabled to prevent unseen changes in the code.

19 11 There are competing taxonomies being used to describe massively parallel software implementations depending on the source of a given piece of writing. As this project implements the functions in CUDA and C++ the description of implementations will use the taxonomy presented in the CUDA specification where necessary [18]. Timing the execution speeds is accomplished in the sequential implementations using the C++ chrono library. For GPU implementations the timing is recorded using event management functions found in the CUDA runtime API. The final runtime results presented later are an average of one-hundred executions. As both implementations use the same function to load the search image to be analyzed this overhead is not counted toward execution time. It should be noted that any memory management required by the GPU implementation is counted. This memory management isn t required in the sequential implementations due to the fact there is no need to allocate or initialize memory and transfer data to and from a discrete compute device in these implementations. Not counting this overhead against the GPU implementation execution time would be unfair as it is impossible to execute the data processing algorithms on the GPU without these utility functions. There is also overhead present in the GPU implementations due to the need to check for CUDA error codes which is not present in the sequential counterpart however it is believed this overhead is negligible. Verifying correctness and completeness of the implementation is done by using test data samples authored for the National Institute for Standards and Technology s (NIST) Computer Forensics Reference Data Sets (CFReDS) project [17].

20 12 V. PROPERTIES OF DATA The CFReDS data samples being used are from the File Carving data set and were developed to provide forensic practitioners a way to systematically test and compare tools across a range of carving scenarios. Published with the samples are the types of files, start and end sector locations, file fragmentation state, and datatype of the fragmentation gaps. These samples were developed without file system data to demonstrate the difficulty inherent in file recovery where file system structures are not available. Examples of carving situations include sequentially fragmented files, non-sequentially fragmented files, files with missing fragments, files with nested fragments, and files with braided fragments [17]. The search images contain a variety of file types from categories such as images, movies, audio, and documents. For the experiments presented the graphics data samples were used, specifically L0_Graphic.dd and L3_Graphic.dd. The samples from the graphics set contain common graphics file types. Both samples contain one file each of the formats JPEG, TIF, PNG, PCX, BMP, and GIF [17]. The experiment testing the pattern search function utilizes L0_Graphic.dd and searches for the header and footer of the single JPEG file in the data. The experiment to identify the fragmentation point or end-of-file sector was run on both samples, running a search for a JPEG header and then trying to find the point where the associated JPEG file either ends or fragments. In L0_Graphic.dd the JPEG is unfragmented and in L3_Graphic.dd the JPEG file is fragmented and is missing fragments.

21 13 VI. EXPERIMENTAL RESULTS A. Pattern Search Searching for byte sequences is a heavily studied topic in both sequential and parallel contexts. Finding byte sequences is the basis of structure-based carving and used in hybrid carvers to increase the true-positive rate so it was important to implement for this project as well. Two implementations were developed: Seq_search and Gpu_search. Each can be called with a pattern representing the header or footer value of a given file type. For these experiments a common header/footer pair used by Joint Photographic Experts Group (JPEG) images was chosen, the hexadecimal values of which are shown in Listing 1. Both implementations are simple brute force search algorithms. For timing purposes each was called twice, once for the header and once for the footer. Header: 0xff 0xd8 0xff 0xe0 0x00 0x10 Footer: 0xff 0xd9 Listing 1: Hexadecimal values of JPEG header and footer The search functions return offset values representing the byte position in the search image where the pattern begins, if found. If multiple pattern occurrences are found the offset for each pattern is returned. Seq_search is implemented using standard C++ data structures and the search function from algorithm.h. The standard library search function performs a brute force search of the data and returns an iterator to the beginning of the pattern if it is found. If the pattern is not found the function returns an iterator to the end of the data being searched. The pattern to search for is

22 14 stored in a vector, the data to be searched is stored in a vector, and the offsets for found patterns are stored in a list. The search runs as a simple loop described in pseudocode in Listing 2. Data := Data to be searched Pattern := Pattern to search for Results_List := List to store results Offset := Beginning of Data While ( Offset is not the End of Data ) { Offset := Search ( Pattern ) If ( Offset is not the End of Data ) Results_List.Store ( Offset ) } Listing 2: Pseudocode of the sequential search algorithm Gpu_search is implemented in a similar way. The pattern to search for and the data to be searched are stored in character arrays. The offsets of any patterns found are stored in an integer array equal in length to the data to be searched. If the pattern is found, the integer at the position equal to the offset where it was found is set to one. The search is launched with one thread per byte of search data organized into blocks of the maximum width allowed which on the GP104 is This allows the global index of the thread to act as an index to the byte the thread starts on within the search data. The search is described in the Listing 3 and is executed by each thread. The relative speedup of the GPU implementation over the sequential implementation is shown in Table 1.

23 15 Data := Data to search Data_Index := Thread global index Data_Length := Length of data to be searched Pattern := Pattern to search for Pattern_Index := 0 Pattern_Length := Length of pattern to search for Results := Integer Array of Zeroes While ( Data_Index < Data_length AND Pattern_Index < Pattern_Length ) { If ( Data[Data_Index] Is Not Equal to Pattern[Pattern_Index] ) { The pattern is not found Break from While Loop } Data_Index := Data_Index + 1 Pattern_Index := Pattern_Index + 1 } If ( Pattern_Index Equals Pattern_Length ) { The Pattern Was Found Results[ Data_Index - Pattern_Length ] := 1 } Listing 3: Pseudocode of the parallel search implementation Implementation Average Runtime (S) Speedup Seq_search X Gpu_search X Table 1: Average search runtime using L0_Graphic.dd sample

24 16 Even with a considerable management overhead disadvantages and lacking optimization efforts the Gpu_search implementation achieved significantly improved execution times. Both implementations located the header and footer of the single JPEG file located in the sample data with zero error. The JPEG header and footer search testing showed other important results. During the first experiment it was discovered that 0xFF 0xD9 is a very common value sequence. For example in the test sample L0_Graphic.dd the sequence occurs 357 times despite the sample containing only a single JPEG file. Without intelligent carving this would result in large numbers of false positives and likely corruption of any true positives. To overcome this during the experiment, the footer was modified to include 0x00 as a third byte in the sequence. This modification eliminated all false positives from the result sets. It is not known how this modification would hold up on real-world data at this time. As these implementations call the search functions once for each header and footer pattern one additional modification was tested on the GPU implementation. GPU kernels can be launched sequentially in a single context or concurrently in multiple contexts. The CUDA API achieves this through what NVIDIA calls streams and each kernel launch can include a stream number parameter identifying which stream the kernel should use. A concurrent approach was tested where the header search was launched in one stream followed by an immediate launch of the footer search in a second stream so that both kernels were executing at the same time. The concurrent kernel execution time was longer than sequential kernel execution. Rennich has shown previously that running concurrent kernels can take longer than running the same kernels serially on earlier NVIDIA architectures [22]. Further investigation would be needed to find out why

25 17 similar observations were made on the newer architecture used in this study. One possibility is that the high number of blocks and threads led to contention for GPU resources and reduced performance overall. This is supported by the fact that in this experiment, also using L0_Graphic.dd, each block utilized the maximum thread count of 1024 and each kernel utilized over 63,000 blocks. In compute capability >= 3.0 the maximum threads per multiprocessor is 2048 and the GP104 contains twenty multiprocessors so 100% occupancy is achieved by only forty blocks. When both kernels execute concurrently that means over 126,000 blocks are competing for forty spaces, a consequence of which may be increased runtime. B. Histogram The calculation of a byte value histogram for a memory sector or fragment is the first step towards identifying the content of the memory in content-based carving. Histograms are well suited for parallel execution as counting byte value occurrences can be performed in any sequence. Two implementations were made, Seq_histo and Gpu_histo, and experiments conducted in the same way as Seq_search and Gpu_search. Both implementations produce a 256- bin histogram, representing the possible base-ten integer values of a byte, for each sector of the search image. In these experiments the sector size was 512. The Seq_histo function stores each sector histogram in a contiguous block of memory treated as an array of integers, the size of which is therefore the number of sectors multiplied by 256 multiplied by the implementation size of integers in memory. This could certainly be improved. Since the data is an array of characters using the contents at a given index as an unsigned character, and therefore an integer data type, allows each byte of data to act as an index into the bin of a histogram. The function is implemented as a nested loop described in Listing 4.

26 18 Sectors := Number of sectors to process J := 0 Data := The data to process Data_Size := The size in bytes of the data to process Histogram := The histogram integer array For ( J < Sectors ) { D_offset := J * 512 H_offset := J * 256 K := 0 For ( K < 512 ) { If ( K + D_offset < Data_Size ) Histogram[ Data [ K + D_offset ] + H_offset ] += 1 K += 1 } J += 1 } Listing 4: Pseudocode of sequential histogram algorithm Gpu_histo is implemented in a similar fashion. Again, as with Gpu_search, a single thread is used for each byte of data in the search image allowing a thread s global index to act as an index into the data array. The threads are again organized into blocks of 1024 threads. As with Seq_histo there is one histogram per sector of the search image data and all histograms are stored in a contiguous block of memory. The pseudocode is described in Listing 5.

27 19 J := global thread index Sector := J / 512 // Integer Division H_offset := Sector * 256 Data := Data to be searched Data_Size := Size of data to be searched Histogram := Histogram integer array If ( J < Data_Size ) atomicadd ( Histogram [ J + H_offset ], 1 ) Listing 5: Pseudocode of parallel histogram algorithm // Adds 1 to the bin CUDA includes several atomic functions which allow for atomic operations on shared resources. In Gpu_histo the atomic addition function allows all threads to share the histogram memory and add to the bins without creating race conditions. Table 2 shows the relative speedup of the parallel implementation compared to the sequential implementation. Implementation Average Runtime (S) Speedup Seq_histo X Gpu_histo X Table 2: Average histogram runtime using L0_Graphic.dd sample The Gpu_histo function achieves significantly less speedup than Gpu_search. The implementation makes this observation predictable. Gpu_histo utilizes one allocation of global memory on the GPU to store all the histogram data. Global memory is the worst performing memory on the GPU and as such as the highest runtime cost. In addition to this any time two

28 20 threads need to add to the same bin there will be memory contention. The worst case can be observed during the experiment using the sample data L0_Graphic.dd. There are sectors where every byte in the sector holds the same value such as 0xB9. In such cases every single thread processing the sector, 512 threads, contend for the resource and the additions become serialized. The other 512 threads in the block of 1024 threads which work on the next sector of memory could potentially be forced to wait excessively. The function suffers the same occupancy problem as Gpu_search in that 100% occupancy is achieved with just 40 blocks while the kernel launches with more than 63,000. An earlier version of Gpu_histo attempted a different organization. In this version a kernel launch processed one histogram for one sector of the search data. This organization was completely unacceptable as the overhead cost of kernel launches made the runtime excessively long. It serves as an example that resource management is more important to total runtime than the algorithm filling in the histograms. C. Normalization The histogram distributions need to be normalized for some statistical measures to be generated later in the program. As you cannot normalize a non-existent histogram the experiment for this function extends the previous experiment by adding the normalization step. Both versions of the histogram are kept in memory for later use. The Seq_normalize function is like Seq_histo and described in Listing 6.

29 21 Histogram := Array of 256 bin histograms, one for each sector, created during Seq_histo Normalized_Histogram := Array of 256 bin histograms, one for each sector Sectors := Number of sectors to process J := 0 For ( J < Sectors ) { L := 0 H := J * 256 K := 0 For ( K < 256 ) { If ( Histogram[ K + H ] > L ) L := Histogram[ K + H ] K := K + 1 } K := 0 For ( K < 256 ) { Normalized_Histogram[ K + H ] := Histogram[ K + H ] / L K := K + 1 } J := J + 1 } Listing 6: Pseudocode of sequential normalization function The Gpu_normalize function utilizes one thread block of 256 threads for each sector histogram and therefore a number of blocks equal to the number of sector histograms needing to be processed. This allows the block index to act as the sector number identifying the histogram to be processed. 256 threads load one index each of the histogram to be normalized from the global memory into shared memory. This is necessary in this step to perform an in place parallel reduction. The reduction is used to find the largest value in the histogram and requires modifying the histogram being processed. Using shared memory allows this while it avoids modifying

30 22 global memory. Each thread then loads one index each of the normalized histogram with the quotient of the corresponding input histogram value at the index divided by the largest value found in the reduction. The pseudocode for the normalization is shown in Listing 7 and Table 3 shows the speedup comparison of the GPU and sequential implementations. This pseudocode is executed by each kernel block. Histogram := Array of 256 bin histograms, one for each sector, in global memory Normalized_Histogram := Array of 256 bin histograms, one for each sector Shared_Reduce := Array of 256 integers in shared block memory Shared_Mean := Array of 256 floats in shared block memory Shared_Largest := A single integer in shared block memory I := The thread s global index T := The thread s block index B := The block index Stride := 128 Shared_Reduce[ T ] := Histogram [ I ] // Each thread loads a single element from global array For ( Stride > 0 ) { If ( T < Stride ) { If ( Shared_Reduce[ I ] < Shared_Reduce[ I + Stride ] ) Shared_Reduce[ I ] = Shared_Reduce[ I + Stride ] } Stride = Stride / 2 } If ( T == 0 ) // Only a single thread needs to load the shared variable Shared_Largest = Shared_Reduce[ T ] Shared_Reduce[ T ] = Histogram[ I ] // Reloads the shared memory with original values Shared_Mean[ T ] = Shared_Reduce[ T ] / Shared_Largest // Conversion to float is necessary Normalized_Histogram[ I ] = Shared_Mean[ T ] Listing 7: Pseudocode for parallel normalization of a histogram

31 23 Implementation Average Runtime (S) Speedup Seq_normalize X Gpu_normalize X Table 3: Average runtime using L0_Graphic.dd sample Adding this step to the histogram function increases the relative performance gain of the GPU implementation over the sequential implementation from 11X to 13X. D. Correlation Coefficients, Standard Deviation, Mean To characterize the content of a sector of data some measures are needed. Measures chosen were standard deviation, mean, and the cosine similarity. The sequential implementation pseudocode of these functions is given in Listing 8 through Listing 11. For both the sequential and parallel versions the sector mean and standard deviation are stored in a custom data structure for later use with one such structure required for each sector of data processed. Array := An array of 256 floats I := 0 Sum := 0.0 // Used to return the result on function return For ( I < 256 ) { Sum := Sum + Array[ I ] I := I + 1 } Listing 8: Pseudocode of sequential function to calculate an array summation

32 24 Array := An array of 256 floats Result := 0.0 // Used to return the result on function return Result := Summation ( Array ) // Summation function describe in Listing 8 Result := Result / 256 Listing 9: Pseudocode of sequential function to calculate arithmetic mean of an array of floats Array := An array of 256 floats Mean := Average ( Array ) // Arithmetic mean function described in Listing 9 Standard_Dev := 0.0 // Used to return the result on function return I := 0 For ( I < 256 ) { Standard_Dev := Standard_Dev + ( ( Array[ I ] Mean ) * (Array[ I ] Mean ) ) I := I + 1 } Standard_Dev := Square_Root ( Standard_Dev / 256 ) Listing 10: Pseudocode for sequential function to calculate an array s standard deviation

33 25 A := An array of 256 floats B := An array of 256 floats Mul := 0.0 D_A := 0.0 D_B := 0.0 Cosine_Sim := 0.0 // Used to return the result on function return I := 0 For ( I < 256 ) { Mul := Mul + A[ I ] * B[ I ] D_A := D_A + A[ I ] * A[ I ] D_B := D_B + B[ I ] * B[ I ] I := I + 1 } Cosine_Sim := Mul / ( Square_Root( D_A ) * Square_Root( D_B ) ) Listing 11: Pseudocode for sequential function to calculate cosine similarity of two arrays The parallel implementations used to compute the same measures are described in pseudocode in Listing 12 and Listing 13. As the arithmetic mean and standard deviation are calculated using the normalized distributions they were added as another step to the normalization kernel. Cosine similarity is a measure between two sectors which cannot be precomputed in this way and is calculated in a separate kernel. Normalized_Histogram := Array of 256 bin histograms, one for each sector Sector_Structures := Array of structures, one for each sector Shared_Reduce := Array of 256 integers in shared block memory Shared_Mean := Array of 256 floats in shared block memory Shared_Average := A single float in shared block memory I := The thread s global index T := The thread s block index B := The block index

34 26 Stride := 128 Shared_Mean[ T ] = Normalized_Histogram[ I ] // Load values into shared memory For ( Stride > 0 ) { If ( T < Stride ) Shared_Mean[ T ] := Shared_Mean[ T ] + Shared_Mean[ T + Stride ] Stride = Stride / 2 } If ( T == 0 ) { // Only one thread needs to load the variable Shared_Average = Shared_Mean[ T ] / 256 Sector_Structures[ B ].Average = Shared_Average } Shared_Mean[ T ] = Normalized_Histogram[ I ] // Load values into shared memory Shared_Mean[ T ] = ( Shared_Mean[ T ] Shared_Average ) * ( Shared_Mean[ T ] Shared_Average ) Stride := 128 For ( Stride > 0 ) { If ( T < Stride ) Shared_Mean[ T ] = Shared_Mean[ T ] + Shared_Mean[ T + Stride ] } If ( T == 0 ) // Only one thread needs to load the variable Sector_Structures[ B ].Standard_Deviation = Square_Root ( Shared_Mean[ T ] / 256 ) Listing 12: Pseudocode for parallel calculation of sector standard deviation and mean Cosine_Similarity := A float in global memory to store the result A := Array of 256 floats in global memory B := Array of 256 floats in global memory Shared_Mul := Array of 256 floats Shared_A := Array of 256 floats Shared_B := Array of 256 floats Shared_DA := Array of 256 floats Shared_DB := Array of 256 floats

35 27 I := The thread s global index T = The thread s block index Stride := 128 Shared_A[ T ] := A[ T ] Shared_B[ B ] := B[ T ] Shared_Mul[ T ] := Shared_A[ T ] * Shared_B[ T ] Shared_DA[ T ] := Shared_A[ T ] * Shared_A [ T ] Shared_DB[ T ] := Shared_B[ T ] * Shared_B[ T ] For ( Stride > 0 ) { If ( T < Stride ) { Shared_Mul[ T ] = Shared_Mul[ T ] + Shared_Mul[ T + Stride ] Shared_DA[ T ] = Shared_DA[ T ] + Shared_DA[ T + Stride ] Shared_DB[ T ] = Shared_DB[ T ] + Shared_DB[ T + Stride ] } Stride = Stride / 2 } If ( T == 0 ) { // Only one thread needs to load the result variable Cosine_Similarity[ T ] = Shared_Mul[ T ] / ( Square_Root( Shared_DA[ T ] ) * Square_Root( Shared_DB[ T ] ) ) } Listing 13: Pseudocode for parallel calculation of cosine similarity between two arrays To test these functions a new experiment was devised. In this experiment the timing starts after the search image is loaded. The search function is then used to find the occurrence of any headers, storing the offset of the first byte of the header if found. The search image then has all sector histograms and normalized sector histograms generated and stored to memory. Then for each sector the arithmetic mean and standard deviation of the sector s normalized distribution are calculated. These values are stored to the custom structure datatype where one structure holds the values for a single sector. Then starting with the sector where a header was found the steps

36 28 described in Listing 14 are performed. For simplicity the same control loop described in Listing 14 was used for both the sequential and GPU implementation, calling the appropriate version of the cosine similarity function where indicated. Even though some of the measures calculated for each sector are not used later in the process they were still calculated so that the implementations could be compared given a large set tasks to accomplish. Results := A list to store offsets While ( There are more header locations in the list ) { Sector := The normalized distribution of the values at that sector location Next sector := The next sector s normalized distribution of values Frag := 0 While ( Not Frag ) { Judgement := cosine_similarity ( Sector, Next sector ) If ( Judgement <.25 ) Frag := 1 Else Go to Next Sector } Results := The index Sector that failed the judgment test } Listing 14: Pseudocode for fragmentation detection This process starts at the sector where the header was found and compares it to the next sector using cosine similarity. When the cosine similarity of two consecutive sectors drops below a certain threshold the index is saved. The sector at that index is likely one of two possible situations, the end of the file or a file fragmentation point.

37 29 Even this simple use of the cosine similarity measure correctly identified the fragmentation point of the JPEG file at a threshold of.25 in tests on L0_Graphic.dd and L3_Graphic.dd. The same experiment was conducted using another correlation measure called Pearson s Correlation Coefficient (PCC) instead of cosine similarity. PCC as the judgment value was shown to be less effective as it did not correctly identify the sector index of JPEG fragmentation or file end. As PCC was not to be used it will not be described here. Table 4 shows the relative speedup of the GPU implementation over the sequential implementation. Implementation Average Runtime (S) Speedup Seq_imp X Gpu_imp X Table 4: Average runtime using L0_Graphic.dd sample Performance analysis of this section shows interesting results. Despite achieving only 12X speedup over the sequential implementation less than 2% of the runtime is spent on kernel computation. The cosine similarity kernel was executed 119 times during this experiment and accounted for only 0.04% of the total runtime. Nearly all the observed GPU overhead was seen in memory allocation and copying at 27.86% of the total runtime. This indicates that fragmentation detection could achieve much higher speedup if fully implemented on the GPU, avoiding the costly CPU management of the cosine similarity algorithm as described in Listing 14.

38 30 VII. COMPARISON WITH RELATED WORK Consistent with previous research this project successfully demonstrated digital forensics functions executing on a GPU and shows GPU acceleration is a promising way to address the challenges in the field [15][4]. This project goes a step further as previous researchers focused on the various forensics applications of pattern searching using a GPU [15][4]. Modern file carving strategies were evaluated and functions common to many of the strategies were chosen. By implementing these functions twice, sequentially and as GPU kernels, comparisons were made to show that acceleration of a given function is possible.

39 31 VIII. CONCLUSION This project shows functions used in digital forensics file carving can achieve improved runtimes by executing on massively parallel hardware and further shows some of the difficulties encountered when doing so. The functions include pattern search, histogram, normalization, and statistics generation, all of which benefited from executing on a GPU. During experimentation speedups from 11X to 82X where observed in GPU implementations compared to their sequential CPU implementation counterparts. Problems were encountered and identified which if addressed could lead to even greater performance gains. The translation of the sequential algorithm to a parallel algorithm must take into consideration the unique characteristics of the hardware being used. The algorithms developed for this project used similar strategies for both sequential and parallel functions so that comparisons could be made. On the GPU implementations this lead to massive resource contention and processor occupancy problems. In general one aim of an optimal GPU kernel is to achieve high processor occupancy but not at the cost of having many thousands of blocks waiting for hardware space to execute on. The functions implemented in this project such as Gpu_search and Gpu_histo show low arithmetic intensity. This can be attributed to the design where each byte of the search image data is processed by a unique thread. For these functions the clear majority of threads will complete a single critical operation. In Gpu_histo for example each thread conducts a single add operation. Considering the amount of overhead in creating and managing the thread this design is extremely wasteful. Using two-dimensional kernels and more advanced designs like tiling so that each

40 32 thread could process multiple bytes of data would lead to higher arithmetic intensity and better performance.

41 33 IX. FUTURE WORK This project shows the potential benefits of parallel computing in digital forensics. As the Heterogenous System Architecture (HSA) becomes more popular and the APIs such as OpenCL improve it will become even easier to develop digital forensics applications that utilize not only discrete GPUs but also multicore processors and processors integrated with GPU hardware all at the same time. In such an implementation it is easy to imagine the benefits of having a core application running parallel management tasks. The core process could manage the various kernel launches and other data processing needed while the kernels process enormous amounts of data asynchronously. A similar structure is already present behind the Sleuthkit interface. Sleuthkit is a popular open source digital forensics framework. In it a main process communicates with other processes through a message passing interface (MPI) which itself is already a common strategy in some parallel programming implementations. The main process conducts many database operations as the image is processed and there is already research showing GPU acceleration of database operations is possible [9]. There are certainly many ways digital forensics applications such as Sleuthkit can further be improved using parallel processing. As for the individual functions presented in this project each could be greatly improved through better algorithm design and by following CUDA programming best practices. For example privatization is one technique which could be used to reduce the resource contention present in Gpu_histo. Using privatization each block of threads is allocated a private block of shared memory to operate on. Once computation is complete the results are merged back into the global memory allocation. Furthermore, using more advanced algorithms, such as the Boyer-Moore

42 34 pattern search example described earlier, in place of brute force strategies would likely reduce the GPU overhead cost by avoiding unnecessary work.

43 35 APPENDIX A : Hardware Configuration CPU: AMD A K Core Frequency (GHz) Cores 4 Memory Controller Clock (MHz) 2133 Memory Channels 2 L2 Cache ( # x MB) 2 x 2 Integrated GPU AMD Radeon R7 GPU: MSI GeForce GTX 1080 SEA HAWK X Memory (MB) 8192 Memory Interface (b) 256 Memory Speed (Gbps) 10 Memory Bandwidth (GB/Sec) 320 Memory Clock (MHz) Core Frequency (MHz) 1607/1683/1708 CUDA Cores 2560 System Memory: G.SKILL TridentX Memory (# x GB) 2 x 8 Memory Clock (MHz) 2133 DRAM Interface DDR3 Mainboard: ASRock FM2A88X-ITX+ FM2+

44 36 REFERENCES 1. M. C. Amirani, M. Toorani and A. Beheshti. (2008). A new approach to content-based file type detection. Presented at 2008 IEEE Symposium on Computers and Communications. [Online]. Available: 2. N. Beebe. (2009, January). Digital forensic research: The good, the bad and the unaddressed. Presented at IFIP International Conference on Digital Forensics. [Online]. Available: 3. W. Calhoun, D. Coles. (2008, August). Predicting the Types of File Fragments. Presented at The Digital Forensic Research Conference. [Online]. Available: 4. C. H. Chen, F. Wu. (2012, January). An Efficient Acceleration of Digital Forensics Search Using GPGPU. Presented at the International Conference on Security and Management (SAM). [Online]. Available: 5. M. Cohen. (2007, December). Advanced Carving Techniques. Digital Investigation. [Online]. 4 (3), pp Available: 6. S. Collange, Y. S. Dandass, M. Daumas, & D. Defour. (2009, January). Using graphics processors for parallelizing hash-based data carving. Presented at 42 nd Hawaii International Conference on System Sciences. [Online]. Available:

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics