ACCELERATION OF DIGITAL FORENSICS FUNCTIONS USING A GPU. A Project. Presented to the faculty of the Department of Computer Science

Size: px
Start display at page:

Download "ACCELERATION OF DIGITAL FORENSICS FUNCTIONS USING A GPU. A Project. Presented to the faculty of the Department of Computer Science"

Transcription

1 ACCELERATION OF DIGITAL FORENSICS FUNCTIONS USING A GPU A Project Presented to the faculty of the Department of Computer Science California State University, Sacramento Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in Computer Science by Kristofer Carlos Robles SPRING 2017

2 2017 Kristofer Carlos Robles ALL RIGHTS RESERVED ii

3 ACCELERATION OF DIGITAL FORENSICS FUNCTIONS USING A GPU A Project by Kristofer Carlos Robles Approved by:, Committee Chair Dr. Pinar Muyan-Ozcelik Date iii

4 Student: Kristofer Carlos Robles I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the project., Graduate Coordinator Dr. Jinsong Ouyang Date Department of Computer Science iv

5 Abstract of ACCELERATION OF DIGITAL FORENSICS FUNCTIONS USING A GPU by Kristofer Carlos Robles When file system metadata is corrupted, missing, or otherwise unreliable, file carving is the strategy used to recover files from a data volume. Difficulties arise when files within the data volume are stored in more than one fragment. These difficulties are compounded by the very large data volumes which are common today. This project shows how the methods used to address these difficulties can be greatly accelerated using parallel algorithms executed on a Graphics Processing Unit (GPU) which has a massively parallel architecture. The work of previous researchers on the same problem, largely conducted in the context of sequential algorithms and hardware, is used to guide the parallel implementation. Functions for pattern search, histogram, normalization, and statistical calculations are evaluated. With little to no optimization effort speedups of 11X - 82X are achieved by parallel implementations over sequential implementations in these functions which are used during file carving operations. Data sets containing fragmented files authored specifically for evaluating file carving tools are used to evaluate and compare implementations., Committee Chair Dr. Pinar Muyan-Ozcelik Date v

6 TABLE OF CONTENTS Page List of Tables... viii Chapter 1. INTRODUCTION BACKGROUND... 3 A. GPU Computing... 3 B. Digital Forensics... 4 C. File Carving Structure-Based Carving Content-Based Carving Other Carving Strategies REVIEW OF RELATED WORK CONCEPTUAL MODEL AND DESIGN PROPERTIES OF DATA EXPERIMENTAL RESULTS A. Pattern Search B. Histogram C. Normalization D. Correlation Coefficients, Standard Deviation, Mean COMPARISON WITH RELATED WORK CONCLUSION vi

7 9. FUTURE WORK Appendix A. Hardware Configuration References vii

8 Tables LIST OF TABLES Page 1. Average search runtime using L0_Graphic.dd sample Average histogram runtime using L0_Graphic.dd sample Average normalization runtime using L0_Graphic.dd sample Average implementation runtime using L0_Graphic.dd sample. 29 viii

9 1 I. INTRODUCTION Digital forensic investigation has grown in importance as the global population continues to become more technology dependent and interconnected. These days it is difficult for any type of crime to occur without leaving at least some trace of digital evidence such as geolocation, communications, images, videos, and more. This information can be used to confirm or deny alibis and associations or in some cases to directly show culpability. In the world of computer crimes sometimes the digital evidence is the only evidence available to investigators. Many of the same digital forensic tools and methods developed for use in criminal investigations are seeing use daily in data recovery, computer intrusion investigations, and computer attack attribution. The forensic examination of digital data volumes requires understanding what the volume contains, for example file names, file types, file history metadata, and the files themselves. When the file system metadata, such as the File Allocation Table (FAT) in a FAT32 formatted volume, is corrupted, missing, or otherwise unreliable the process of retrieving this information is referred to as file carving. File carving is problematic and becomes very difficult when files are stored in multiple fragments within the data volume. While some implementations have been successful at achieving accurate file recovery or fast execution speed it is difficult to achieve both simultaneously. This project aims to utilize a GPU, an often-underused resource of computational power in non-graphics applications, to address these problems. Consumer grade GPU hardware uses massively parallel architectures and are relatively cheap compared to other options of similar computational power. This project splits the computational

10 2 load of file carving between a CPU and a GPU to show that the functions used in file carvers can be greatly accelerated when executed as parallel algorithms on a GPU. The specific hardware configuration is listed in Appendix A. The hardware was chosen to demonstrate what can be expected from a low-power and low-cost implementation as many law enforcement agencies and information security teams may be working within tight budget constraints.

11 3 II. BACKGROUND A. GPU Computing This project shows how the use of parallel algorithms developed to run on a consumer grade GPU can greatly accelerate the functions used in file carving operations. Although it is common for games and graphics programs to make great use of GPU hardware it is an often-underutilized resource of additional compute power for non-graphics applications. GPU technology continues to push forward as higher display resolutions and virtual reality become mainstream in the marketplace. Common hardware such as the NVIDIA GTX 1080 used in this project contain 2560 compute cores and large amounts of dedicated memory. Manufacturers such as NVIDIA and organizations such as the Heterogeneous System Architecture (HSA) Foundation have also been improving the software interfaces to GPU hardware making utilization easier than ever for application programmers. This project uses NVIDIA s Compute Unified Device Architecture (CUDA) platform [18] which will feel familiar to programmers who have used C++. GPUs were designed for highly parallel graphical processing tasks and ever since NVIDIA released the world s first GPU in 1999 they have always included large numbers of relatively low-power processing units working in parallel [19]. Researchers noticed these resources and began using them to solve non-graphical problems even before the hardware producers provided a convenient application programming interface (API), which may have even driven GPU makers to make more of the hardware components of GPUs programmable over time. This early work likely helped to drive interest in and development of new applications of GPU hardware and improved programming interfaces. The market demand for specialized GPU computing hardware is high enough today that NVIDIA has taken a modular approach of building some GPU

12 4 configurations specifically for compute-intensive applications and others for graphics applications using the same underlying architecture components [20]. B. Digital Forensics Given the challenges faced by digital forensics investigators today it is natural to ask if this underutilized and economical resource could be applied. The background material on digital forensics history, process, and the state of the art has been covered extensively and will not be reintroduced here [2][14]. This project will focus on the second stage of a forensic inquiry, analysis, which can be one of the more computationally intense stages of a forensic inquiry. During the first stage of a typical forensic examination a copy of the digital data volume being examined is made to be analyzed without risk of modifying the original. In most literature this copy of data is referred to as the image or search image as it is often collected using the same techniques used to image a hard disk drive (HDD). Once acquired and validated against the original volume using a cryptographic hash the second stage, analysis, begins. The first task during analysis is to understand what the image contains. In the context of a consumer PC analysis aims to retrieve file names, types, history metadata, and the files themselves in a machine-readable format. In the best case scenario the file system metadata of the image is available and these aims can be achieved with accuracy and confidence. When the file system metadata is corrupted, missing, or suspected to be unreliable due to tampering achieving this understanding of the image contents becomes much more complicated. In such a scenario file carving is used to analyze the image.

13 5 C. File Carving File carving attempts to recover the files from a search image without the use of file system metadata. Under these conditions the best chance of successful recovery of a given file is possible when the file is stored in one contiguous segment within the image. When a file is stored as two or more segments within the image, known as fragmentation, successful recovery becomes much more difficult. Many file carving implementations struggle to recover fragmented files accurately or do not scale well enough to be practical given the enormous data volumes common today. Of primary concern in both traditional criminal investigations and computer emergency response is the accuracy of a given forensic method and its speed of execution. Inaccurate or untimely evidence has little to no value in these pursuits. It has been shown in previous research these two attributes are deeply intertwined in file carving implementations [11]. There are several file carving strategies proposed in existing research and the major categories are described below. Some of these strategies are used in existing commercial and open source file carvers. Laurenson s evaluation of available carving software shows that achieving both speed and accuracy in implementation is difficult and that the difficulty increases if the files are stored in more than one fragment [11]. This result is especially important for investigators to understand as user modified files, typically the most relevant to a forensic inquiry, have been shown the most likely to be fragmented [8]. Taken together the investigator today is often left with incomplete, inaccurate, or time intensive analysis and in the worst case may suffer all three. 1) Structure-Based Carving Structure-based carving was one of the first file carving strategies proposed and was implemented in tools such as Foremost and Scalpel [23]. During structure-based carving the data volume is

14 6 searched for known values or data structures associated with specific file types. A common example, also used later in this project, is the header and footer search. Some file types have a specific byte sequence which acts as the effective beginning and end of the file, the so-called header and footer. A simple carver using this strategy finds the memory cluster containing a header, finds the memory cluster containing an associated footer, and assumes that the clusters in between the two (inclusive) belong to the file to be recovered. Note that this assumption is only true if the file was stored as one contiguous memory segment, with no other allocated or empty clusters in between the header and footer, and that such a simple strategy will fail to recover fragmented files. There are other problems with the simple structure-based approach. It requires knowing the byte sequences, data structures, or headers and footers of each file type to search for ahead of time. Already there are a multitude of these magic numbers. If these magic numbers are short, for example three bytes long, it is possible the byte sequence would be a common occurrence within the data volume and this would cause many false positives. Furthermore the application will need to be updated any time a new sequence is put into use by new file types. There is also the possibility a file type doesn t use such byte sequences or that a malicious actor could spoof the byte sequences to reduce risk of detection. String search algorithms play an important role in structure-based carving strategies. Liao provides an excellent analysis of different string search algorithms [13]. Other research indicates Boyer-Moore and Aho-Corasick string matching algorithms are potential candidates for parallel string searching [25].

15 7 2) Content-Based Carving Content-based carving grew from the solutions developed for content-based file identification. The problems inherent in structure-based file identification, which are very similar to the problems presented for structure-based file carvers, motivated content-based identification which was introduced by McDaniel and Heydari [16] and later built upon by others [12][10][26]. The general idea of content-based identification is to characterize a file type based on the typical contents of such a file, then compare an unknown file to the known characterizations to determine what it is. The general strategies developed for file-type identification were applied to file fragment reassembly [3][7]. The byte values of fragments or sectors are collected, typically as normalized frequency distributions and statistical measures applied to try and make judgements about the type of data in the fragment. Calhoun and Coles used mean, modes, standard deviation, entropy, in addition to the longest common subsequence of fragments to determine fragment type [3]. Fitzgerald et al. gather the statistical measures of known files to create feature vectors and train a support vector machine which is then used to classify unknown fragments with accuracy as high as 99.7% in the best case [7]. McDaniel and Heydari noted there are some file types without strong patterns to characterize and increased the accuracy of identification by combining structure-based identification methods with content-based methods [16]. Other research utilizes Principal Component Analysis (PCA) and unsupervised neural networks to conduct content-based file identification [1].

16 8 3) Other Carving Strategies Another strategy involves assembling the fragments and attempting to open them in the appropriate application, if the file opens it is assumed not to be a corrupted false-positive [8]. Other research describes carving as a mapping function between the image data and the file data allowing the problem to become a function optimization problem [5]. The problem has also been described as an optimal path construction problem [27]. Context-based carving is another proposed approach. In it the fragments are first put into categories representing the files they would belong to such as text or executables. Then the fragments of similar type are reassembled based on probabilities that a given fragment would follow the preceding fragment [24]. SmartCarver is another solution to file carving. SmartCarver reduces the amount of data to be processed in the carving stage by removing any disk cluster which is known to be allocated and only carving unallocated clusters [21]. This is a slightly different problem than situations where no file system is present, however it is interesting to see that file carving is needed even in cases where the file system is available.

17 9 III. REVIEW OF RELATED WORK Some research and implementations are available showing promising results are possible when using GPUs to conduct digital forensics. Scalpel is one structure-based carver which utilizes header and footer values and one of the few file carvers already implemented on GPU hardware using the CUDA programming language [15]. Unfortunately, speeding up Scalpel by running it on a GPU does not address the problems inherent in structure based-carving. In Laurenson s performance measurement of Scalpel it carved many thousands of false-positives when confronted with highly fragmented data [11]. Other researchers focused on GPU acceleration of digital forensics keyword searches [4]. When conducting an inquiry it is often helpful to search for keywords related to the matter being investigated. Sequential searches of large data volumes can be very time consuming so a GPU accelerated search was proposed. The implementation showed a potential speedup of 100X during experimentation [4]. A related task of finding known contraband images is also very time consuming when the search image is many gigabytes in length. Researchers proposed using a GPU to take hashes of file data in the volume and in order to compare them to file hashes of known contraband images [6]. These examples show promise but narrowly focus on various forms and applications of pattern searching. This project intends to go further by showing how functions used by advanced carving strategies can also be accelerated using a GPU.

18 10 IV. CONCEPTUAL MODEL AND DESIGN This project aims to show how massive parallelization of functions used during the carving process can achieve faster execution times. Functions representing core functionality of the most effective known file carving methods were chosen for evaluation (e.g., pattern search, histogram, normalization, correlation, mean, and standard deviation). By choosing specific functions, such as the header and footer search function, a clear comparison between a sequential version and a parallel version can be made. This will help in identifying areas with the most potential for improvement through parallelization. This strategy implies certain consequences. For any comparison of sequential and parallel functions to be valid the functions should be roughly equivalent in design. As an example it would be unfair to compare a brute force sequential pattern search with a parallel implementation of the Boyer-Moore pattern search algorithm. The Boyer-Moore search algorithm takes advantage of a unique property of pattern searching in order to skip unnecessary evaluations, and generally gets faster as the pattern being searched for increases in length. On the other hand a brute force searching is a very simple design and skips no evaluations. Comparing the two would introduce the differences in algorithm strategies into the results. This project aims to focus on the differences between the types of hardware executing the algorithms. The parallel implementations necessarily will differ in key aspects, however, they were designed to be comparable in strategy to the sequential counterparts. It would also be unfair to look for optimization opportunities in one side of the comparison and not the other. Every attempt is made to keep the comparison as valid as possible by coding the functions specifically for this project. Furthermore compiler optimization is disabled to prevent unseen changes in the code.

19 11 There are competing taxonomies being used to describe massively parallel software implementations depending on the source of a given piece of writing. As this project implements the functions in CUDA and C++ the description of implementations will use the taxonomy presented in the CUDA specification where necessary [18]. Timing the execution speeds is accomplished in the sequential implementations using the C++ chrono library. For GPU implementations the timing is recorded using event management functions found in the CUDA runtime API. The final runtime results presented later are an average of one-hundred executions. As both implementations use the same function to load the search image to be analyzed this overhead is not counted toward execution time. It should be noted that any memory management required by the GPU implementation is counted. This memory management isn t required in the sequential implementations due to the fact there is no need to allocate or initialize memory and transfer data to and from a discrete compute device in these implementations. Not counting this overhead against the GPU implementation execution time would be unfair as it is impossible to execute the data processing algorithms on the GPU without these utility functions. There is also overhead present in the GPU implementations due to the need to check for CUDA error codes which is not present in the sequential counterpart however it is believed this overhead is negligible. Verifying correctness and completeness of the implementation is done by using test data samples authored for the National Institute for Standards and Technology s (NIST) Computer Forensics Reference Data Sets (CFReDS) project [17].

20 12 V. PROPERTIES OF DATA The CFReDS data samples being used are from the File Carving data set and were developed to provide forensic practitioners a way to systematically test and compare tools across a range of carving scenarios. Published with the samples are the types of files, start and end sector locations, file fragmentation state, and datatype of the fragmentation gaps. These samples were developed without file system data to demonstrate the difficulty inherent in file recovery where file system structures are not available. Examples of carving situations include sequentially fragmented files, non-sequentially fragmented files, files with missing fragments, files with nested fragments, and files with braided fragments [17]. The search images contain a variety of file types from categories such as images, movies, audio, and documents. For the experiments presented the graphics data samples were used, specifically L0_Graphic.dd and L3_Graphic.dd. The samples from the graphics set contain common graphics file types. Both samples contain one file each of the formats JPEG, TIF, PNG, PCX, BMP, and GIF [17]. The experiment testing the pattern search function utilizes L0_Graphic.dd and searches for the header and footer of the single JPEG file in the data. The experiment to identify the fragmentation point or end-of-file sector was run on both samples, running a search for a JPEG header and then trying to find the point where the associated JPEG file either ends or fragments. In L0_Graphic.dd the JPEG is unfragmented and in L3_Graphic.dd the JPEG file is fragmented and is missing fragments.

21 13 VI. EXPERIMENTAL RESULTS A. Pattern Search Searching for byte sequences is a heavily studied topic in both sequential and parallel contexts. Finding byte sequences is the basis of structure-based carving and used in hybrid carvers to increase the true-positive rate so it was important to implement for this project as well. Two implementations were developed: Seq_search and Gpu_search. Each can be called with a pattern representing the header or footer value of a given file type. For these experiments a common header/footer pair used by Joint Photographic Experts Group (JPEG) images was chosen, the hexadecimal values of which are shown in Listing 1. Both implementations are simple brute force search algorithms. For timing purposes each was called twice, once for the header and once for the footer. Header: 0xff 0xd8 0xff 0xe0 0x00 0x10 Footer: 0xff 0xd9 Listing 1: Hexadecimal values of JPEG header and footer The search functions return offset values representing the byte position in the search image where the pattern begins, if found. If multiple pattern occurrences are found the offset for each pattern is returned. Seq_search is implemented using standard C++ data structures and the search function from algorithm.h. The standard library search function performs a brute force search of the data and returns an iterator to the beginning of the pattern if it is found. If the pattern is not found the function returns an iterator to the end of the data being searched. The pattern to search for is

22 14 stored in a vector, the data to be searched is stored in a vector, and the offsets for found patterns are stored in a list. The search runs as a simple loop described in pseudocode in Listing 2. Data := Data to be searched Pattern := Pattern to search for Results_List := List to store results Offset := Beginning of Data While ( Offset is not the End of Data ) { Offset := Search ( Pattern ) If ( Offset is not the End of Data ) Results_List.Store ( Offset ) } Listing 2: Pseudocode of the sequential search algorithm Gpu_search is implemented in a similar way. The pattern to search for and the data to be searched are stored in character arrays. The offsets of any patterns found are stored in an integer array equal in length to the data to be searched. If the pattern is found, the integer at the position equal to the offset where it was found is set to one. The search is launched with one thread per byte of search data organized into blocks of the maximum width allowed which on the GP104 is This allows the global index of the thread to act as an index to the byte the thread starts on within the search data. The search is described in the Listing 3 and is executed by each thread. The relative speedup of the GPU implementation over the sequential implementation is shown in Table 1.

23 15 Data := Data to search Data_Index := Thread global index Data_Length := Length of data to be searched Pattern := Pattern to search for Pattern_Index := 0 Pattern_Length := Length of pattern to search for Results := Integer Array of Zeroes While ( Data_Index < Data_length AND Pattern_Index < Pattern_Length ) { If ( Data[Data_Index] Is Not Equal to Pattern[Pattern_Index] ) { The pattern is not found Break from While Loop } Data_Index := Data_Index + 1 Pattern_Index := Pattern_Index + 1 } If ( Pattern_Index Equals Pattern_Length ) { The Pattern Was Found Results[ Data_Index - Pattern_Length ] := 1 } Listing 3: Pseudocode of the parallel search implementation Implementation Average Runtime (S) Speedup Seq_search X Gpu_search X Table 1: Average search runtime using L0_Graphic.dd sample

24 16 Even with a considerable management overhead disadvantages and lacking optimization efforts the Gpu_search implementation achieved significantly improved execution times. Both implementations located the header and footer of the single JPEG file located in the sample data with zero error. The JPEG header and footer search testing showed other important results. During the first experiment it was discovered that 0xFF 0xD9 is a very common value sequence. For example in the test sample L0_Graphic.dd the sequence occurs 357 times despite the sample containing only a single JPEG file. Without intelligent carving this would result in large numbers of false positives and likely corruption of any true positives. To overcome this during the experiment, the footer was modified to include 0x00 as a third byte in the sequence. This modification eliminated all false positives from the result sets. It is not known how this modification would hold up on real-world data at this time. As these implementations call the search functions once for each header and footer pattern one additional modification was tested on the GPU implementation. GPU kernels can be launched sequentially in a single context or concurrently in multiple contexts. The CUDA API achieves this through what NVIDIA calls streams and each kernel launch can include a stream number parameter identifying which stream the kernel should use. A concurrent approach was tested where the header search was launched in one stream followed by an immediate launch of the footer search in a second stream so that both kernels were executing at the same time. The concurrent kernel execution time was longer than sequential kernel execution. Rennich has shown previously that running concurrent kernels can take longer than running the same kernels serially on earlier NVIDIA architectures [22]. Further investigation would be needed to find out why

25 17 similar observations were made on the newer architecture used in this study. One possibility is that the high number of blocks and threads led to contention for GPU resources and reduced performance overall. This is supported by the fact that in this experiment, also using L0_Graphic.dd, each block utilized the maximum thread count of 1024 and each kernel utilized over 63,000 blocks. In compute capability >= 3.0 the maximum threads per multiprocessor is 2048 and the GP104 contains twenty multiprocessors so 100% occupancy is achieved by only forty blocks. When both kernels execute concurrently that means over 126,000 blocks are competing for forty spaces, a consequence of which may be increased runtime. B. Histogram The calculation of a byte value histogram for a memory sector or fragment is the first step towards identifying the content of the memory in content-based carving. Histograms are well suited for parallel execution as counting byte value occurrences can be performed in any sequence. Two implementations were made, Seq_histo and Gpu_histo, and experiments conducted in the same way as Seq_search and Gpu_search. Both implementations produce a 256- bin histogram, representing the possible base-ten integer values of a byte, for each sector of the search image. In these experiments the sector size was 512. The Seq_histo function stores each sector histogram in a contiguous block of memory treated as an array of integers, the size of which is therefore the number of sectors multiplied by 256 multiplied by the implementation size of integers in memory. This could certainly be improved. Since the data is an array of characters using the contents at a given index as an unsigned character, and therefore an integer data type, allows each byte of data to act as an index into the bin of a histogram. The function is implemented as a nested loop described in Listing 4.

26 18 Sectors := Number of sectors to process J := 0 Data := The data to process Data_Size := The size in bytes of the data to process Histogram := The histogram integer array For ( J < Sectors ) { D_offset := J * 512 H_offset := J * 256 K := 0 For ( K < 512 ) { If ( K + D_offset < Data_Size ) Histogram[ Data [ K + D_offset ] + H_offset ] += 1 K += 1 } J += 1 } Listing 4: Pseudocode of sequential histogram algorithm Gpu_histo is implemented in a similar fashion. Again, as with Gpu_search, a single thread is used for each byte of data in the search image allowing a thread s global index to act as an index into the data array. The threads are again organized into blocks of 1024 threads. As with Seq_histo there is one histogram per sector of the search image data and all histograms are stored in a contiguous block of memory. The pseudocode is described in Listing 5.

27 19 J := global thread index Sector := J / 512 // Integer Division H_offset := Sector * 256 Data := Data to be searched Data_Size := Size of data to be searched Histogram := Histogram integer array If ( J < Data_Size ) atomicadd ( Histogram [ J + H_offset ], 1 ) Listing 5: Pseudocode of parallel histogram algorithm // Adds 1 to the bin CUDA includes several atomic functions which allow for atomic operations on shared resources. In Gpu_histo the atomic addition function allows all threads to share the histogram memory and add to the bins without creating race conditions. Table 2 shows the relative speedup of the parallel implementation compared to the sequential implementation. Implementation Average Runtime (S) Speedup Seq_histo X Gpu_histo X Table 2: Average histogram runtime using L0_Graphic.dd sample The Gpu_histo function achieves significantly less speedup than Gpu_search. The implementation makes this observation predictable. Gpu_histo utilizes one allocation of global memory on the GPU to store all the histogram data. Global memory is the worst performing memory on the GPU and as such as the highest runtime cost. In addition to this any time two

28 20 threads need to add to the same bin there will be memory contention. The worst case can be observed during the experiment using the sample data L0_Graphic.dd. There are sectors where every byte in the sector holds the same value such as 0xB9. In such cases every single thread processing the sector, 512 threads, contend for the resource and the additions become serialized. The other 512 threads in the block of 1024 threads which work on the next sector of memory could potentially be forced to wait excessively. The function suffers the same occupancy problem as Gpu_search in that 100% occupancy is achieved with just 40 blocks while the kernel launches with more than 63,000. An earlier version of Gpu_histo attempted a different organization. In this version a kernel launch processed one histogram for one sector of the search data. This organization was completely unacceptable as the overhead cost of kernel launches made the runtime excessively long. It serves as an example that resource management is more important to total runtime than the algorithm filling in the histograms. C. Normalization The histogram distributions need to be normalized for some statistical measures to be generated later in the program. As you cannot normalize a non-existent histogram the experiment for this function extends the previous experiment by adding the normalization step. Both versions of the histogram are kept in memory for later use. The Seq_normalize function is like Seq_histo and described in Listing 6.

29 21 Histogram := Array of 256 bin histograms, one for each sector, created during Seq_histo Normalized_Histogram := Array of 256 bin histograms, one for each sector Sectors := Number of sectors to process J := 0 For ( J < Sectors ) { L := 0 H := J * 256 K := 0 For ( K < 256 ) { If ( Histogram[ K + H ] > L ) L := Histogram[ K + H ] K := K + 1 } K := 0 For ( K < 256 ) { Normalized_Histogram[ K + H ] := Histogram[ K + H ] / L K := K + 1 } J := J + 1 } Listing 6: Pseudocode of sequential normalization function The Gpu_normalize function utilizes one thread block of 256 threads for each sector histogram and therefore a number of blocks equal to the number of sector histograms needing to be processed. This allows the block index to act as the sector number identifying the histogram to be processed. 256 threads load one index each of the histogram to be normalized from the global memory into shared memory. This is necessary in this step to perform an in place parallel reduction. The reduction is used to find the largest value in the histogram and requires modifying the histogram being processed. Using shared memory allows this while it avoids modifying

30 22 global memory. Each thread then loads one index each of the normalized histogram with the quotient of the corresponding input histogram value at the index divided by the largest value found in the reduction. The pseudocode for the normalization is shown in Listing 7 and Table 3 shows the speedup comparison of the GPU and sequential implementations. This pseudocode is executed by each kernel block. Histogram := Array of 256 bin histograms, one for each sector, in global memory Normalized_Histogram := Array of 256 bin histograms, one for each sector Shared_Reduce := Array of 256 integers in shared block memory Shared_Mean := Array of 256 floats in shared block memory Shared_Largest := A single integer in shared block memory I := The thread s global index T := The thread s block index B := The block index Stride := 128 Shared_Reduce[ T ] := Histogram [ I ] // Each thread loads a single element from global array For ( Stride > 0 ) { If ( T < Stride ) { If ( Shared_Reduce[ I ] < Shared_Reduce[ I + Stride ] ) Shared_Reduce[ I ] = Shared_Reduce[ I + Stride ] } Stride = Stride / 2 } If ( T == 0 ) // Only a single thread needs to load the shared variable Shared_Largest = Shared_Reduce[ T ] Shared_Reduce[ T ] = Histogram[ I ] // Reloads the shared memory with original values Shared_Mean[ T ] = Shared_Reduce[ T ] / Shared_Largest // Conversion to float is necessary Normalized_Histogram[ I ] = Shared_Mean[ T ] Listing 7: Pseudocode for parallel normalization of a histogram

31 23 Implementation Average Runtime (S) Speedup Seq_normalize X Gpu_normalize X Table 3: Average runtime using L0_Graphic.dd sample Adding this step to the histogram function increases the relative performance gain of the GPU implementation over the sequential implementation from 11X to 13X. D. Correlation Coefficients, Standard Deviation, Mean To characterize the content of a sector of data some measures are needed. Measures chosen were standard deviation, mean, and the cosine similarity. The sequential implementation pseudocode of these functions is given in Listing 8 through Listing 11. For both the sequential and parallel versions the sector mean and standard deviation are stored in a custom data structure for later use with one such structure required for each sector of data processed. Array := An array of 256 floats I := 0 Sum := 0.0 // Used to return the result on function return For ( I < 256 ) { Sum := Sum + Array[ I ] I := I + 1 } Listing 8: Pseudocode of sequential function to calculate an array summation

32 24 Array := An array of 256 floats Result := 0.0 // Used to return the result on function return Result := Summation ( Array ) // Summation function describe in Listing 8 Result := Result / 256 Listing 9: Pseudocode of sequential function to calculate arithmetic mean of an array of floats Array := An array of 256 floats Mean := Average ( Array ) // Arithmetic mean function described in Listing 9 Standard_Dev := 0.0 // Used to return the result on function return I := 0 For ( I < 256 ) { Standard_Dev := Standard_Dev + ( ( Array[ I ] Mean ) * (Array[ I ] Mean ) ) I := I + 1 } Standard_Dev := Square_Root ( Standard_Dev / 256 ) Listing 10: Pseudocode for sequential function to calculate an array s standard deviation

33 25 A := An array of 256 floats B := An array of 256 floats Mul := 0.0 D_A := 0.0 D_B := 0.0 Cosine_Sim := 0.0 // Used to return the result on function return I := 0 For ( I < 256 ) { Mul := Mul + A[ I ] * B[ I ] D_A := D_A + A[ I ] * A[ I ] D_B := D_B + B[ I ] * B[ I ] I := I + 1 } Cosine_Sim := Mul / ( Square_Root( D_A ) * Square_Root( D_B ) ) Listing 11: Pseudocode for sequential function to calculate cosine similarity of two arrays The parallel implementations used to compute the same measures are described in pseudocode in Listing 12 and Listing 13. As the arithmetic mean and standard deviation are calculated using the normalized distributions they were added as another step to the normalization kernel. Cosine similarity is a measure between two sectors which cannot be precomputed in this way and is calculated in a separate kernel. Normalized_Histogram := Array of 256 bin histograms, one for each sector Sector_Structures := Array of structures, one for each sector Shared_Reduce := Array of 256 integers in shared block memory Shared_Mean := Array of 256 floats in shared block memory Shared_Average := A single float in shared block memory I := The thread s global index T := The thread s block index B := The block index

34 26 Stride := 128 Shared_Mean[ T ] = Normalized_Histogram[ I ] // Load values into shared memory For ( Stride > 0 ) { If ( T < Stride ) Shared_Mean[ T ] := Shared_Mean[ T ] + Shared_Mean[ T + Stride ] Stride = Stride / 2 } If ( T == 0 ) { // Only one thread needs to load the variable Shared_Average = Shared_Mean[ T ] / 256 Sector_Structures[ B ].Average = Shared_Average } Shared_Mean[ T ] = Normalized_Histogram[ I ] // Load values into shared memory Shared_Mean[ T ] = ( Shared_Mean[ T ] Shared_Average ) * ( Shared_Mean[ T ] Shared_Average ) Stride := 128 For ( Stride > 0 ) { If ( T < Stride ) Shared_Mean[ T ] = Shared_Mean[ T ] + Shared_Mean[ T + Stride ] } If ( T == 0 ) // Only one thread needs to load the variable Sector_Structures[ B ].Standard_Deviation = Square_Root ( Shared_Mean[ T ] / 256 ) Listing 12: Pseudocode for parallel calculation of sector standard deviation and mean Cosine_Similarity := A float in global memory to store the result A := Array of 256 floats in global memory B := Array of 256 floats in global memory Shared_Mul := Array of 256 floats Shared_A := Array of 256 floats Shared_B := Array of 256 floats Shared_DA := Array of 256 floats Shared_DB := Array of 256 floats

35 27 I := The thread s global index T = The thread s block index Stride := 128 Shared_A[ T ] := A[ T ] Shared_B[ B ] := B[ T ] Shared_Mul[ T ] := Shared_A[ T ] * Shared_B[ T ] Shared_DA[ T ] := Shared_A[ T ] * Shared_A [ T ] Shared_DB[ T ] := Shared_B[ T ] * Shared_B[ T ] For ( Stride > 0 ) { If ( T < Stride ) { Shared_Mul[ T ] = Shared_Mul[ T ] + Shared_Mul[ T + Stride ] Shared_DA[ T ] = Shared_DA[ T ] + Shared_DA[ T + Stride ] Shared_DB[ T ] = Shared_DB[ T ] + Shared_DB[ T + Stride ] } Stride = Stride / 2 } If ( T == 0 ) { // Only one thread needs to load the result variable Cosine_Similarity[ T ] = Shared_Mul[ T ] / ( Square_Root( Shared_DA[ T ] ) * Square_Root( Shared_DB[ T ] ) ) } Listing 13: Pseudocode for parallel calculation of cosine similarity between two arrays To test these functions a new experiment was devised. In this experiment the timing starts after the search image is loaded. The search function is then used to find the occurrence of any headers, storing the offset of the first byte of the header if found. The search image then has all sector histograms and normalized sector histograms generated and stored to memory. Then for each sector the arithmetic mean and standard deviation of the sector s normalized distribution are calculated. These values are stored to the custom structure datatype where one structure holds the values for a single sector. Then starting with the sector where a header was found the steps

36 28 described in Listing 14 are performed. For simplicity the same control loop described in Listing 14 was used for both the sequential and GPU implementation, calling the appropriate version of the cosine similarity function where indicated. Even though some of the measures calculated for each sector are not used later in the process they were still calculated so that the implementations could be compared given a large set tasks to accomplish. Results := A list to store offsets While ( There are more header locations in the list ) { Sector := The normalized distribution of the values at that sector location Next sector := The next sector s normalized distribution of values Frag := 0 While ( Not Frag ) { Judgement := cosine_similarity ( Sector, Next sector ) If ( Judgement <.25 ) Frag := 1 Else Go to Next Sector } Results := The index Sector that failed the judgment test } Listing 14: Pseudocode for fragmentation detection This process starts at the sector where the header was found and compares it to the next sector using cosine similarity. When the cosine similarity of two consecutive sectors drops below a certain threshold the index is saved. The sector at that index is likely one of two possible situations, the end of the file or a file fragmentation point.

37 29 Even this simple use of the cosine similarity measure correctly identified the fragmentation point of the JPEG file at a threshold of.25 in tests on L0_Graphic.dd and L3_Graphic.dd. The same experiment was conducted using another correlation measure called Pearson s Correlation Coefficient (PCC) instead of cosine similarity. PCC as the judgment value was shown to be less effective as it did not correctly identify the sector index of JPEG fragmentation or file end. As PCC was not to be used it will not be described here. Table 4 shows the relative speedup of the GPU implementation over the sequential implementation. Implementation Average Runtime (S) Speedup Seq_imp X Gpu_imp X Table 4: Average runtime using L0_Graphic.dd sample Performance analysis of this section shows interesting results. Despite achieving only 12X speedup over the sequential implementation less than 2% of the runtime is spent on kernel computation. The cosine similarity kernel was executed 119 times during this experiment and accounted for only 0.04% of the total runtime. Nearly all the observed GPU overhead was seen in memory allocation and copying at 27.86% of the total runtime. This indicates that fragmentation detection could achieve much higher speedup if fully implemented on the GPU, avoiding the costly CPU management of the cosine similarity algorithm as described in Listing 14.

38 30 VII. COMPARISON WITH RELATED WORK Consistent with previous research this project successfully demonstrated digital forensics functions executing on a GPU and shows GPU acceleration is a promising way to address the challenges in the field [15][4]. This project goes a step further as previous researchers focused on the various forensics applications of pattern searching using a GPU [15][4]. Modern file carving strategies were evaluated and functions common to many of the strategies were chosen. By implementing these functions twice, sequentially and as GPU kernels, comparisons were made to show that acceleration of a given function is possible.

39 31 VIII. CONCLUSION This project shows functions used in digital forensics file carving can achieve improved runtimes by executing on massively parallel hardware and further shows some of the difficulties encountered when doing so. The functions include pattern search, histogram, normalization, and statistics generation, all of which benefited from executing on a GPU. During experimentation speedups from 11X to 82X where observed in GPU implementations compared to their sequential CPU implementation counterparts. Problems were encountered and identified which if addressed could lead to even greater performance gains. The translation of the sequential algorithm to a parallel algorithm must take into consideration the unique characteristics of the hardware being used. The algorithms developed for this project used similar strategies for both sequential and parallel functions so that comparisons could be made. On the GPU implementations this lead to massive resource contention and processor occupancy problems. In general one aim of an optimal GPU kernel is to achieve high processor occupancy but not at the cost of having many thousands of blocks waiting for hardware space to execute on. The functions implemented in this project such as Gpu_search and Gpu_histo show low arithmetic intensity. This can be attributed to the design where each byte of the search image data is processed by a unique thread. For these functions the clear majority of threads will complete a single critical operation. In Gpu_histo for example each thread conducts a single add operation. Considering the amount of overhead in creating and managing the thread this design is extremely wasteful. Using two-dimensional kernels and more advanced designs like tiling so that each

40 32 thread could process multiple bytes of data would lead to higher arithmetic intensity and better performance.

41 33 IX. FUTURE WORK This project shows the potential benefits of parallel computing in digital forensics. As the Heterogenous System Architecture (HSA) becomes more popular and the APIs such as OpenCL improve it will become even easier to develop digital forensics applications that utilize not only discrete GPUs but also multicore processors and processors integrated with GPU hardware all at the same time. In such an implementation it is easy to imagine the benefits of having a core application running parallel management tasks. The core process could manage the various kernel launches and other data processing needed while the kernels process enormous amounts of data asynchronously. A similar structure is already present behind the Sleuthkit interface. Sleuthkit is a popular open source digital forensics framework. In it a main process communicates with other processes through a message passing interface (MPI) which itself is already a common strategy in some parallel programming implementations. The main process conducts many database operations as the image is processed and there is already research showing GPU acceleration of database operations is possible [9]. There are certainly many ways digital forensics applications such as Sleuthkit can further be improved using parallel processing. As for the individual functions presented in this project each could be greatly improved through better algorithm design and by following CUDA programming best practices. For example privatization is one technique which could be used to reduce the resource contention present in Gpu_histo. Using privatization each block of threads is allocated a private block of shared memory to operate on. Once computation is complete the results are merged back into the global memory allocation. Furthermore, using more advanced algorithms, such as the Boyer-Moore

42 34 pattern search example described earlier, in place of brute force strategies would likely reduce the GPU overhead cost by avoiding unnecessary work.

43 35 APPENDIX A : Hardware Configuration CPU: AMD A K Core Frequency (GHz) Cores 4 Memory Controller Clock (MHz) 2133 Memory Channels 2 L2 Cache ( # x MB) 2 x 2 Integrated GPU AMD Radeon R7 GPU: MSI GeForce GTX 1080 SEA HAWK X Memory (MB) 8192 Memory Interface (b) 256 Memory Speed (Gbps) 10 Memory Bandwidth (GB/Sec) 320 Memory Clock (MHz) Core Frequency (MHz) 1607/1683/1708 CUDA Cores 2560 System Memory: G.SKILL TridentX Memory (# x GB) 2 x 8 Memory Clock (MHz) 2133 DRAM Interface DDR3 Mainboard: ASRock FM2A88X-ITX+ FM2+

44 36 REFERENCES 1. M. C. Amirani, M. Toorani and A. Beheshti. (2008). A new approach to content-based file type detection. Presented at 2008 IEEE Symposium on Computers and Communications. [Online]. Available: 2. N. Beebe. (2009, January). Digital forensic research: The good, the bad and the unaddressed. Presented at IFIP International Conference on Digital Forensics. [Online]. Available: 3. W. Calhoun, D. Coles. (2008, August). Predicting the Types of File Fragments. Presented at The Digital Forensic Research Conference. [Online]. Available: 4. C. H. Chen, F. Wu. (2012, January). An Efficient Acceleration of Digital Forensics Search Using GPGPU. Presented at the International Conference on Security and Management (SAM). [Online]. Available: 5. M. Cohen. (2007, December). Advanced Carving Techniques. Digital Investigation. [Online]. 4 (3), pp Available: 6. S. Collange, Y. S. Dandass, M. Daumas, & D. Defour. (2009, January). Using graphics processors for parallelizing hash-based data carving. Presented at 42 nd Hawaii International Conference on System Sciences. [Online]. Available:

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Multi-version Data recovery for Cluster Identifier Forensics Filesystem with Identifier Integrity

Multi-version Data recovery for Cluster Identifier Forensics Filesystem with Identifier Integrity Multi-version Data recovery for Cluster Identifier Forensics Filesystem with Identifier Integrity Mohammed Alhussein, Duminda Wijesekera Department of Computer Science George Mason University Fairfax,

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING Larson Hogstrom, Mukarram Tahir, Andres Hasfura Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 18.337/6.338

More information

Introduction to carving File fragmentation Object validation Carving methods Conclusion

Introduction to carving File fragmentation Object validation Carving methods Conclusion Simson L. Garfinkel Presented by Jevin Sweval Introduction to carving File fragmentation Object validation Carving methods Conclusion 1 Carving is the recovery of files from a raw dump of a storage device

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Multicore Computing and Scientific Discovery

Multicore Computing and Scientific Discovery scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research

More information

Parallelism. Parallel Hardware. Introduction to Computer Systems

Parallelism. Parallel Hardware. Introduction to Computer Systems Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Introduction. Collecting, Searching and Sorting evidence. File Storage

Introduction. Collecting, Searching and Sorting evidence. File Storage Collecting, Searching and Sorting evidence Introduction Recovering data is the first step in analyzing an investigation s data Recent studies: big volume of data Each suspect in a criminal case: 5 hard

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Online Document Clustering Using the GPU

Online Document Clustering Using the GPU Online Document Clustering Using the GPU Benjamin E. Teitler, Jagan Sankaranarayanan, Hanan Samet Center for Automation Research Institute for Advanced Computer Studies Department of Computer Science University

More information

File Carving Using Sequential Hypothesis Testing

File Carving Using Sequential Hypothesis Testing File Carving Using Sequential Hypothesis Testing Anandabrata (Pasha) Pal, Taha Sencar and Nasir Memon Introduction File Carving: recovery without file system meta-data. Recovery based on file structure/content

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28 CS 220: Introduction to Parallel Computing Introduction to CUDA Lecture 28 Today s Schedule Project 4 Read-Write Locks Introduction to CUDA 5/2/18 CS 220: Parallel Computing 2 Today s Schedule Project

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 341 6.2 Types of Memory 341 6.3 The Memory Hierarchy 343 6.3.1 Locality of Reference 346 6.4 Cache Memory 347 6.4.1 Cache Mapping Schemes 349 6.4.2 Replacement Policies 365

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

Predicting the Types of File Fragments

Predicting the Types of File Fragments Predicting the Types of File Fragments William C. Calhoun and Drue Coles Department of Mathematics, Computer Science and Statistics Bloomsburg, University of Pennsylvania Bloomsburg, PA 17815 Thanks to

More information

Using Graphics Processors for High Performance IR Query Processing

Using Graphics Processors for High Performance IR Query Processing Using Graphics Processors for High Performance IR Query Processing Shuai Ding Jinru He Hao Yan Torsten Suel Polytechnic Inst. of NYU Polytechnic Inst. of NYU Polytechnic Inst. of NYU Yahoo! Research Brooklyn,

More information

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed. CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed. File-System Structure File structure Logical storage unit Collection of related information File

More information

CS3600 SYSTEMS AND NETWORKS

CS3600 SYSTEMS AND NETWORKS CS3600 SYSTEMS AND NETWORKS NORTHEASTERN UNIVERSITY Lecture 11: File System Implementation Prof. Alan Mislove (amislove@ccs.neu.edu) File-System Structure File structure Logical storage unit Collection

More information

Administrivia. Minute Essay From 4/11

Administrivia. Minute Essay From 4/11 Administrivia All homeworks graded. If you missed one, I m willing to accept it for partial credit (provided of course that you haven t looked at a sample solution!) through next Wednesday. I will grade

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs Where is the Data? Why you Cannot Debate CPU vs. GPU Performance Without the Answer Chris Gregg and Kim Hazelwood University of Virginia Computer Engineering g Labs 1 GPUs and Data Transfer GPU computing

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9 General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

Chapter 11: Implementing File-Systems

Chapter 11: Implementing File-Systems Chapter 11: Implementing File-Systems Chapter 11 File-System Implementation 11.1 File-System Structure 11.2 File-System Implementation 11.3 Directory Implementation 11.4 Allocation Methods 11.5 Free-Space

More information

GPU Programming for Mathematical and Scientific Computing

GPU Programming for Mathematical and Scientific Computing GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311 ethan.kerzner@gmail.com timothy.urness@drake.edu

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Design Better. Reduce Risks. Ease Upgrades. Protect Your Software Investment

Design Better. Reduce Risks. Ease Upgrades. Protect Your Software Investment Protect Your Software Investment Design Better. Reduce Risks. Ease Upgrades. Protect Your Software Investment The Difficulty with Embedded Software Development Developing embedded software is complicated.

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Model-Driven Engineering in Digital Forensics. Jeroen van den Bos with Tijs van der Storm and Leon Aronson

Model-Driven Engineering in Digital Forensics. Jeroen van den Bos with Tijs van der Storm and Leon Aronson Model-Driven Engineering in Digital Forensics Jeroen van den Bos (jeroen@infuse.org) with Tijs van der Storm and Leon Aronson Contents Digital forensics MDE in forensics Domain-specific optimizations Conclusion

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2017 Lecture 15 LAST TIME: CACHE ORGANIZATION Caches have several important parameters B = 2 b bytes to store the block in each cache line S = 2 s cache sets

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio Project Proposal ECE 526 Spring 2006 Modified Data Structure of Aho-Corasick Benfano Soewito, Ed Flanigan and John Pangrazio 1. Introduction The internet becomes the most important tool in this decade

More information

OPERATING SYSTEM. Chapter 12: File System Implementation

OPERATING SYSTEM. Chapter 12: File System Implementation OPERATING SYSTEM Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management

More information

Accelerated Machine Learning Algorithms in Python

Accelerated Machine Learning Algorithms in Python Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals

More information

Kernel level AES Acceleration using GPUs

Kernel level AES Acceleration using GPUs Kernel level AES Acceleration using GPUs TABLE OF CONTENTS 1 PROBLEM DEFINITION 1 2 MOTIVATIONS.................................................1 3 OBJECTIVE.....................................................2

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Verification and Validation of X-Sim: A Trace-Based Simulator

Verification and Validation of X-Sim: A Trace-Based Simulator http://www.cse.wustl.edu/~jain/cse567-06/ftp/xsim/index.html 1 of 11 Verification and Validation of X-Sim: A Trace-Based Simulator Saurabh Gayen, sg3@wustl.edu Abstract X-Sim is a trace-based simulator

More information

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

File System Implementation

File System Implementation File System Implementation Last modified: 16.05.2017 1 File-System Structure Virtual File System and FUSE Directory Implementation Allocation Methods Free-Space Management Efficiency and Performance. Buffering

More information

Speed Up Your Codes Using GPU

Speed Up Your Codes Using GPU Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel

More information

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

Grand Central Dispatch

Grand Central Dispatch A better way to do multicore. (GCD) is a revolutionary approach to multicore computing. Woven throughout the fabric of Mac OS X version 10.6 Snow Leopard, GCD combines an easy-to-use programming model

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018 Memory Bandwidth and Low Precision Computation CS6787 Lecture 10 Fall 2018 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs)

Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs) Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs) Hasindu Gamaarachchi, Roshan Ragel Department of Computer Engineering University of Peradeniya Peradeniya, Sri Lanka hasindu8@gmailcom,

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7 General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

JCudaMP: OpenMP/Java on CUDA

JCudaMP: OpenMP/Java on CUDA JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems

More information

Parallel Approach for Implementing Data Mining Algorithms

Parallel Approach for Implementing Data Mining Algorithms TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

More information

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems Operating System Concepts 99h Edition DM510-14 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation

More information

Lecture 1: January 22

Lecture 1: January 22 CMPSCI 677 Distributed and Operating Systems Spring 2018 Lecture 1: January 22 Lecturer: Prashant Shenoy Scribe: Bin Wang 1.1 Introduction to the course The lecture started by outlining the administrative

More information

Advanced Topics in Computer Architecture

Advanced Topics in Computer Architecture Advanced Topics in Computer Architecture Lecture 7 Data Level Parallelism: Vector Processors Marenglen Biba Department of Computer Science University of New York Tirana Cray I m certainly not inventing

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Chapter 11: Implementing File

Chapter 11: Implementing File Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems Carl Pearson 1, I-Hsin Chung 2, Zehra Sura 2, Wen-Mei Hwu 1, and Jinjun Xiong 2 1 University of Illinois Urbana-Champaign, Urbana

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Flash Drive Emulation

Flash Drive Emulation Flash Drive Emulation Eric Aderhold & Blayne Field aderhold@cs.wisc.edu & bfield@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison Abstract Flash drives are becoming increasingly

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

COMPUTER SCIENCE 4500 OPERATING SYSTEMS

COMPUTER SCIENCE 4500 OPERATING SYSTEMS Last update: 3/28/2017 COMPUTER SCIENCE 4500 OPERATING SYSTEMS 2017 Stanley Wileman Module 9: Memory Management Part 1 In This Module 2! Memory management functions! Types of memory and typical uses! Simple

More information

Accelerating String Matching Using Multi-threaded Algorithm

Accelerating String Matching Using Multi-threaded Algorithm Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition Chapter 11: Implementing File Systems Operating System Concepts 9 9h Edition Silberschatz, Galvin and Gagne 2013 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory

More information

Chapter 12: File System Implementation

Chapter 12: File System Implementation Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics

More information

Running SNAP. The SNAP Team February 2012

Running SNAP. The SNAP Team February 2012 Running SNAP The SNAP Team February 2012 1 Introduction SNAP is a tool that is intended to serve as the read aligner in a gene sequencing pipeline. Its theory of operation is described in Faster and More

More information

Operating System Performance and Large Servers 1

Operating System Performance and Large Servers 1 Operating System Performance and Large Servers 1 Hyuck Yoo and Keng-Tai Ko Sun Microsystems, Inc. Mountain View, CA 94043 Abstract Servers are an essential part of today's computing environments. High

More information

B. Evaluation and Exploration of Next Generation Systems for Applicability and Performance (Volodymyr Kindratenko, Guochun Shi)

B. Evaluation and Exploration of Next Generation Systems for Applicability and Performance (Volodymyr Kindratenko, Guochun Shi) A. Summary - In the area of Evaluation and Exploration of Next Generation Systems for Applicability and Performance, over the period of 01/01/11 through 03/31/11 the NCSA Innovative Systems Lab team investigated

More information

Power Estimation of UVA CS754 CMP Architecture

Power Estimation of UVA CS754 CMP Architecture Introduction Power Estimation of UVA CS754 CMP Architecture Mateja Putic mateja@virginia.edu Early power analysis has become an essential part of determining the feasibility of microprocessor design. As

More information