Identification, Recovery and Verification of MP3 frames from Obfuscated Files. Paul Flack

Size: px

Start display at page:

Download "Identification, Recovery and Verification of MP3 frames from Obfuscated Files. Paul Flack"

Francis Flynn
6 years ago
Views:

Identification, Recovery and Verification of MP3 frames from Obfuscated Files Paul Flack - 40135925 Submitted in partial fulfilment of the

1 Identification, Recovery and Verification of MP3 frames from Obfuscated Files Paul Flack Submitted in partial fulfilment of the requirements of Edinburgh Napier University for the Degree of Bachelor of Engineering (Honours) in Computer Security and Forensics School of Computing April 2017

2 Authorship Declaration I, Paul Flack, confirm that this dissertation and the work presented in it are my own achievement. Where I have consulted the published work of others this is always clearly attributed; Where I have quoted from the work of others the source is always given. With the exception of such quotations this dissertation is entirely my own work; I have acknowledged all main sources of help; If my research follows on from previous work or is part of a larger collaborative research project, I have made clear exactly what was done by others and what I have contributed myself; I have read and understand the penalties associated with Academic Misconduct. I also confirm that I have obtained informed consent from all people I have involved in the work in this dissertation following the School's ethical guidelines Signed: Date: Matriculation no: ii

3 Data Protection Declaration Under the 1998 Data Protection Act, The University cannot disclose your grade to an unauthorised person. However, other students benefit from studying dissertations that have their grades attached. Please sign your name below one of the options below to state your preference. The University may make this dissertation, with indicative grade, available to others. The University may make this dissertation available to others, but the grade may not be disclosed. The University may not make this dissertation available to others. iii

4 Abstract Much of the audio piracy industry revolves around the MP3 music format, and these files can often prove to be crucial evidence in investigations. To this end, file obfuscation is a common technique used by criminals for anti-forensics, allowing criminals to hide potential evidence of piracy or other crimes, from investigators by making it appear innocuous. At present, no tools exist to allow investigators to identify these files. The aim of this project was to develop and implement a methodology that will allow these obfuscated MP3 files to be identified no matter the file type they ve been obfuscated to appear as, and the audio data from these files to then be recovered as playable media files. Expanding this methodology, the project automated this identification and recovery process, by creating a tool that reads the internal file format to find MP3 frames before carving them to a recovery file. It was decided that to assist investigators, a process was needed for verifying the original MP3 that recovered frames had belonged to, to show that the file had been copied from elsewhere and that it was a copyrighted file. To do this Context Triggered Piecewise Hashing was implemented, to compare the recovered files against a database of hashed known MP3 files, to find comparison matches against that database in any recovered files. This automated recovery process was then rigorously tested against individual files of varying size and length, along with 5 USB images with different forms of obfuscation on the MP3 files present. In conclusion, it was discovered that both the initial manual methodology and the automated process were successfully able to identify MP3 frames despite obfuscation and recover these frames as a playable media file, although the high amount of time needed to positively identify each frame header and calculate the length of each frame manually would be impossible to sustain in a real investigation, at which point the speed of the automated process becomes a great boon to investigators. Context triggered Piecewise Hashing was also capable of identifying the recovered files that matched in the database of known files with high consistency rates, though it was ultimately able to name an MP3 but not prove breach of copyright. iv

5 Contents 1 INTRODUCTION Digital Piracy and Forensic Investigations Digital Music and Audio Anti-forensics through Obfuscation Aim Objectives Sources of information Research Questions Chapter Outlines LITERATURE REVIEW File Obfuscation Principles of file carving Cryptographic Hashing and Context Triggered Piecewise Hashing Format of an MP METHODOLOGY Further understanding the MP3 format A process for finding frames of an MP3 File AUTOMATING AND IMPLEMENTING THE RECOVERY PROCESS DEFINING A TESTING PROCEDURE Developing Test Images Choosing Testing Metrics RESULTS Testing the Manual Recovery Process Recovering using the automated process EVALUATION The Process v

6 7.2 Impact on forensic investigations Issues discovered Personal Development CONCLUSIONS Future work REFERENCES... I APPENDICES... III Appendix 1 Project Overview... iii Appendix 2 Second Formal Review Output... vii Appendix 3 Project Management and Diary Sheets... ix Appendix 4 Test Results... xiv Appendix 5 Identification and Recovery Code... xxxvii vi

7 List of Tables Table 1 Version ID Table 2 Layer Description Table 3 - Bit Rate Index (all values in kbps) Table 4 - Sampling Rate (all values in Hz) Table 5 - Defining Audio Channel Mode Table 6 - Mode Extension for Joint Stereo Audio Table 7 - Dolby Audio Emphasis Table 8 - Frame header possibilities: First 16 bits Table 9 - Breakdown of first frame of MP3Test.mp Table 10 - Defining the Test Image Table 11 - Manual Test recovery MP3Test1.jpg Table 12 - Automated Test Table 13 - Automated Test Table 14 - Automated Test Table 15 - Automated Test Table 16 - Results of Image 1 recovery Table 17 - Results of Image 4 recovery Table 18 - Results of Image 2 recovery Table 19 - Results of Image 3 recovery Table 20 - Results of Image 5 recovery Table 21 - Summary of automated test process vii

8 List of Figures Figure 1 - MP3 Frame Layout (Aziz, 2012) Figure 2 - MP3 Frame Structure (id3.org, 2012) Figure 3 - First frame of MP3Test.mp Figure 4 -Version Identification Definitions Figure 5 - Version Identifier Code Figure 6 - First and last indexed locations Figure 7 - Identify frame header and calculate Frame Length Figure 8 - Error check to ensure correct Frame Location Figure 9 - Fuzzy Hash and Comparison code Figure 10 - Initial user interaction Figure 11 - File Mode: User defined file and recovery storage Figure 12 - Image Mode: User defined image and recovery storage viii

9 Acknowledgements I would like to thank my supervisor, Dr Petra Leimich, whose guidance and support have not only helped me through this project, and without whose advice I would have never reached the end, but who also allowed me the opportunity to become interested in the subject matter in the first place. I d also like to thank my parents and In-laws for the support they ve given over the last couple of years. Finally, I would also like to thank my wife Nicola, whose constant support in the last few years has been wonderful, helping me to have the opportunity to do this degree program. ix

10 1 Introduction 1.1 Digital Piracy and Forensic Investigations Every year the number of online devices increases exponentially, and in recent years piracy of digital media has also become more and more common. With the increase of torrent sites offering hundreds, if not thousands of digital media downloads for people to download, the general act of digital piracy has increased exponentially. Coupled with the increase in piracy over the last few years, digital investigations have increased, in an attempt to prove that suspects have pirated digital media. This is a long, arduous task with digital investigations sometimes stretching cases out before charges can be levied against suspects. There are many techniques already being developed to ease this long workload, from advanced file carving techniques through to newer processes attempting to perform triage on disks and disk images. With storage of this digital media growing larger all the time, the length of these investigations will only grow longer if these techniques are not developed. Alongside investigating digital piracy, digital forensic investigations into other crimes can often rely on discovery and recovery of digital audio recordings of further activities, such as recorded phone conversation, or voic messages (Primeau, 2010). 1.2 Digital Music and Audio Much of digital piracy revolves around the music industry, with popular BitTorrent sites, such as The Pirate Bay giving people access to easily download thousands of albums daily. Whilst being available in many formats, the most popular format for downloading this music remains the MP3 format. The popularity of this format can mainly be attributed to its generally low storage requirements, allowing users to generate a large library of tracks with much smaller storage necessary than other audio formats would require. Due to this potential for large libraries of MP3 files, the ability to prove copyright infringement is still a key priority in digital forensic investigations, especially with the recent increase in charges brought against torrent site owners such as Artem Vaulin, of KickAss Torrents(US DoJ, 2016). Finding the digital evidence of this is relatively 1

11 easy when files are whole, even if deleted, but becomes progressively harder to manage when looking at fragmented or obfuscated files. Although much research has been carried out into fragmented files with many techniques to identify and recover file fragments being developed from this research, not much work has been done to allow identification or recovery of obfuscated files. 1.3 Anti-forensics through Obfuscation Obfuscation is the act of taking a file and making it appear to be another file to counter standard digital forensic techniques. The easiest way of doing this is to change the file extension but forensic investigation tools can easily see that change. Instead, the more common method is to change the file by making it appear to be an entirely different format. This technique is known as code obfuscation (Brand, 2006), and is a popular method used by criminals to hide a lot of different files and file types, from digital copyrighted material to illicit materials. Due to the intention of the user being to hide this data from criminal investigations, without making it so they themselves cannot still use or view the data, the most common method of obfuscating files has been to change the magic numbers that make up the header and footer of a file to that of another file type. These numbers are collections of characters found at the beginning of all files, which allow operating systems to identify a file type for every file. Basic tools such as Linux File command, or the Sorter command - part of popular forensic toolset, The Sleuth Kit and even more high profile applications like Autopsy and EnCase only see the obfuscated file type, and are unable to identify the original file type to an investigator without further, more in-depth, investigation which usually takes a lot more time and effort. Because the tools cannot see the format of these files to be that of their original type, it is plausible that the evidence would potentially not be found against the suspect, these criminals can then easily change these headers back when they want access to a file, but investigators are unlikely to ever spot the obfuscation. Finding a method of identifying these filetypes during an investigation, based on the internal structure of the 2

12 file, regardless of its file signature, would allow investigators to completely bypass the file signature analysis and still obtain evidence against suspects. 1.4 Aim The aim of this project is to develop a tool that applies an automated method of identifying the MP3 data within obfuscated files on a disk or disk image, then carve these frames from the obfuscated files to recovery files and use fuzzy hashing to compare these recovered audio frames to a list of hashes of known MP3 files to identify the original file that obfuscated file was copied from. 1.5 Objectives To better deliver on the aim of this project, there are several objectives which must be met: Understanding the MP3 File Format, File Carving techniques and Fuzzy hashing Before anything else, a full understanding of the MP3 file format will be required, the study of this format will aid in developing a procedure for recovery of fragments. This will include understanding individual frame headers are made up, what data and information they contain, and how this can be used to interrogate obfuscated files to identify file portions which look like they re from MP3s. An understanding of common file carving techniques will also be required, this again will mean looking at how these methods are designed, and how files are identified and then carved. The aim of this project requires recovery of MP3 data structures from within obfuscated files, so an understanding of files carving is a basic requirement of this recovery method. Finally, to correctly use fuzzy hashing techniques to identify potential matches for the originals of files, knowledge must be acquired for the principles of fuzzy hashing, and potential existing tools that exist in order to perform this process. 3

13 Understanding these structures and techniques is essential for developing methods that can be used to identify, carve and subsequently verify the frames of MP3 files Develop a Methodology of Recovery Manual Recovery First, a methodology will be developed for identifying MP3 file frames manually, and then once identified, verification of and carving the MP3 frames and reconstructing them into a playable file Creation of Python scripts Once the manual methodology has been developed, a selection of Python scripts will be created, to automate the processes involved in the recovery of the MP3 media files. This is to allow a more efficient method for an examiner to attain the recovered data, with a smaller likelihood of errors. These scripts will provide an accompanying text file for any recovered files, with relevant information about the recovered files and their original counterpart, if identifiable Testing the process The final objective will be the testing of these Python scripts. Various image files will be created with different numbers of MP3 files stored on them, these images will be used to test the recovery rate of the scripts, looking for accuracy of recovered data and check the accompanying information to see which files are matched against known files. 1.6 Sources of information For this dissertation, multiple sources were used to obtain relevant research materials, primary of these was the use of the digital research libraries available at ScienceDirect, IEEE Xplore and ResearchGate as tools to assist in the discovery of pertinent research into the areas of the investigation. Along with the use of these sites to obtain research and to give competent sources, the investigation included the use of the research search engine Google Scholar, Semantic Scholar - a computer science research library provided by the Allen Institute 4

14 for Artificial Intelligence and a selection of blogs to give background data that was useful in developing understanding of the material, these blogs included amateur authors and substantiated authors such as Professor Gary Kessler. 1.7 Research Questions The questions behind this project are: Can the format of an obfuscated MP3 file be identified as an MP3? Can the audio frames from them be recovered using relevant file carving techniques to produce a playable media file? Can the recovered files be compared against a database of known MP3 files, to identify the original MP3s? This dissertation is intended to answer these questions by providing a tool which can use a developed methodology to interrogate a selection of obfuscated files on a disk or disk image, find these file frames and attempt to recover them for playback. In order to do this research will be carried out, looking at the current techniques used in file carving, and then by applying an understanding of the MP3 file structure and fuzzy hashing tools and processes. 1.8 Chapter Outlines Chapter 2 introduces the theory behind the paper, investigating the key concepts and issues involved with file obfuscation, before looking further into the standard means of more traditional file carving techniques. After this, it continues to look at the concepts of cryptographic hashing and the more advanced context triggered piecewise hashing, before finally explaining the MP3 file format and how the internal frame structure of the file format might be used to identify obfuscated files, outside of the use of file signatures to identify a file type. Chapter 3 presents a closer, more compact understanding of the MP3 frame structure and advancing the explanation of the frame header and how this might calculate 5

15 frames from within the file. It describes a method for manually identifying and recovering MP3 files Chapter 4 develops the methodology into a tool for identifying the frames of individual files and recovering them, before describing how the method was expanded from investigating a single file, to be applied to an entire USB disk image full of files, and to define how context triggered piecewise hashing might be used to compare these recovered files against a database of known files to show an MP3 file that may have been the original source of these recovered frames. Chapter 5 then defines a scenario for testing the process, ensuring the method and automated tool are able to identify and recover the MP3 frames of obfuscated files, whilst informing investigators that a non-mp3 file is such. Chapter 6 outlines the results of the testing scenario; starting with manual identification and recovery of an MP3 file, moving through some testing of the automated recovery tool on individual files, and then culminating in the use of the automated tool to recover multiple files from a USB image. Chapter 7 evaluates the process and results, looking at how well the system performed against the testing metrics outlined for testing. It then assesses the usefulness of the overall method, against actual forensic investigation and describes some observed issues with the process. Chapter 8 Concludes the project with an assessment on how the project aims have been met, and how the research questions have been answered, it also talks about recommendations for further work in the field. 6

16 2 Literature Review 2.1 File Obfuscation Obfuscation, as the act of taking a file and making it appear to be another file, affords criminals an opportunity to hide different files and file types from digital forensic investigations. The intention of obfuscation is to allow the user to hide copyrighted or illicit material from criminal investigations, whilst still being able to use or view the data themselves. As such, the most common method of obfuscating files has been to change the magic numbers that make up the header and footer of a file to that of another file type (Sammes & Jenkinson, 2000). These numbers are collections of hex characters found at the beginning of all files, which let the operating system identify a file type for every file. There are many lists of these on the internet and in books such as Forensic computing: A Practitioner s Guide by Sammes & Jenkinson (2000), with one of the most exhaustive lists being an ongoing project of Dr Gary Kessler (2017). Generally, in Windows operating systems all files have an extension as part of their filename, and this is used by Windows to tell what file type a file is, and then allows the OS to identify apps which can open or use that file. Although files with extensions will still be seen on other operating systems such as Linux, these operating systems don t identify the file using the extension as Windows would, but instead rely on the Magic numbers to act as a header and footer for a file. For example, a JPEG file uses the combination 0xFFD8 as a header for a file (with slightly different trailing bytes, dependent on the camera used to take the photo or encoding method used to create the picture) and 0xFFD9 as a trailer, or footer of a file whereas a GIF file would use the combination 0x4749 for a header and the trailer 0x003B (Kessler, 2017). Because these tend to be short strings, usually representing 2 bytes, but sometimes as many as 8 bytes, changing these magic numbers from one file type to match that of another or, alternatively, adding them in front of the existing file type numbers, is 7

17 extremely easy to do, either manually, or through the use of code scripts or bash scripts which can automate it for multiple files at once, and will usually fool an operating system into confusing the file type with that of the new magic number combination. Tools such as Linux File command is often used to identify file types. As is the Sleuth Kit command Sorter but in these instances, these tools, and other, more high profile, applications like Autopsy and EnCase can only see the adjusted file type, and are unable to identify the original file type to an investigator, essentially hiding the file without further, more in-depth investigation, which usually takes a lot more time and effort. Because these files cannot be identified as their original file type, it is potentially plausible that the evidence against the suspect would not be found. These criminals can then easily change these headers back when they want access to a file, but investigators are unlikely to ever spot the obfuscation. Finding a method of identifying these filetypes during an investigation, based on the internal structure of the file, regardless of its file signature, would allow investigators to completely bypass the file signature analysis and still obtain evidence against suspects (Sammes & Jenkinson, 2000). 2.2 Principles of file carving Because obfuscated files cannot, therefore, be identified as their original filetype, it may be possible to design a method of discovering and recovering frames of MP3 files by investigating advanced file carving techniques and the background behind them. File carving was defined by Christiaan Beek as: File Carving or sometimes simply "carving", is the process of extracting a collection of data from a larger data set. (Beek, 2011). This process is used in Computer Forensics to analyse unallocated space on volumes, allowing investigators to recover data which used to be stored on a drive (Pal & Memon, 2008). A standard methodology for doing this is to use header and footer values of files to denote the data which is intended to be recovered. As an example, an investigator attempting to recover a JPEG image would use the hex values 0xFFD8 8

18 and 0xFFD9 to denote the begin and end of a file in unallocated space they would use these to mark where a file starts and ends, carving the data in-between these points and analysing the recovered data. This method works very well, provided that the unallocated space has not been reused for other files in the meantime. The reason this works so well is due to the way current file systems tend to mark space as available for use, without actually deleting any of the existing data. These file systems do this for the convenience of the user; by marking it as available space it frees the location up, without using resources which the user needs for other activities, to overwrite the data. The tools currently available to do this have some limitations, the largest of which being that the metadata for a file must be present or if missing, the parts of the file must be stored contiguously on a device, and that this method can introduce many false positives; by including data within the range of bits being carved, but which is not actually part of the data the investigator requires (Pal & Memon, 2008). They are also unable to identify obfuscated MP3 files as they rely on file headers and footers but don t look at the internal structure of the files being recovered Fragmented File Carving Fragmentation of files makes it almost impossible to use the traditional methodology of file carving, as the files individual fragments are unlikely to be contiguous, and each fragment would not have its own file signature. File fragmentation is the idea that a single file has many parts that may be found at separate locations on a volume (Pahade, Singh, & Singh, 2015). In general, a file system will whenever possible, attempt to store files in consecutive blocks on a volume. This provides efficient retrieval of the data in the file for a user s convenience. Sometimes though, usually due to a lack of available storage space on the device, this is not possible. Instead, the file system will fragment the file, fitting the various parts of a file into the first available blocks it can find. To the user, this doesn t matter, and will usually not even be noticeable. 9

19 This fragmentation may happen more than once on any file and could result in these fragments being spread in varying amounts throughout a hard drive.(de Nijs, Biesheuvel, Denissen, & Lambert, 2006). Because of this method of file fragments being allocated to essentially random free space, there may come an occasion where a file has been deleted, marking this space as unallocated along with a large amount of space around the fragment also becoming available. In this instance, new data being written to the drive may overwrite this location, whilst other fragments of the same file may not get overwritten. Applying this knowledge as it stands to obfuscated files is difficult; the method relies on knowing the file signature of the file that is being looked looking for, but in the case of obfuscated files, this may have been overwritten or deleted. This is especially important to realise when identifying MP3 files which have a known file signature, but do not need this file signature in place to be playable; relying instead on the frame structure discussed in section Instead, more advanced fragment recovery techniques need to be discussed to be able to develop a method of recovery in these situations Smart Carving Instead, SmartCarving which is a more advanced file carving approach was suggested by Rainer Poisel s group (Poisel, Tjoa, & Tavolato, 2011), in which a number of techniques are discussed. Starting with data Pre-processing, wherein the data is parsed for known allocated chunks, using existing tools such the Sleuthing Kit before being passed on to further investigation techniques. By identifying this existing known allocated data, the investigation is able to remove this from its process of investigation, the unallocated data is the only data passed on to the further stages of the investigation.(poisel et al., 2011). Being the next stage of the process, collation is used to sort blocks into file types, this can be used to again reduce the number of blocks being investigated in the next stage of the recovery process, by eliminating the block types which cannot be worked on 10

20 further. This file type sorting can use many different techniques used for file carving to help identify the file types of blocks. These file carving techniques are Signature based, Feature-based and Normalised Compression Distance. Signature based classification is essentially the same method of the standard file carving technique; by using defined signatures from files, which belong to specific formats of files. Feature-based classification looks for properties specific to individual file types. Roussev & Garfinkel (2009) describe this as searching for file specific characteristics, which are dependent upon knowing what file types are being looked for. This paper (Roussev & Garfinkel, 2009) goes on to explain how a lot of encoding methods for certain file components can be the same across multiple file types and that the task here would be to find characteristics specific to only the file types being looked for. These unique file characteristics would be something akin to the FF00 signature found throughout the Jpeg file structure, or the frame header signatures identified in Table 8 and would usually repeat throughout a file. Normalised Compression Distance, instead uses an algorithm which compares information entropy between reference data and the blocks being analysed and then matches the blocks as being the same file type. This essentially provides the investigator with a method of seeing how similar blocks are and thus how likely they are to belong to the same file. Once all these fragments have been identified, no matter their classification process, reassembly can be performed to attempt to give investigators a complete file. To do this correctly, an attempt is made to identify where the file fragmentation occurred allowing the fragments to be reordered if they need to be and they can be, before recreating the original file (Roussev & Garfinkel, 2009). The problem with this method of reassembly is that it is unable to cope with missing, overwritten or corrupted fragments. It assumes the file is wholly available and this assumption is key to complete recovery of the file (Maguire, 2015). 11

21 Fast object validation Alternatively, a method called Fast Object Validation was proposed by Garfinkel (2007). In this method, much like in the feature classification method of SmartCarving, object validation is used to identify blocks of bytes representing specific file types where the header and footer of a file are not present in a fragment. Garfinkel (2007) states that it may be possible to identify a file by using the data structure of the file fragment to identify patterns within the fragment. This allows investigators to identify contiguous fragments of a file but still needs to use extra techniques to identify true fragments, those where the fragments are not contiguous. In order to identify these non-contiguous fragments, Garfinkel talks about using a technique known as Bi-fragment gap carving (Garfinkel, 2007, p. 10) to identify bifragmented file fragments. Garfinkel developed an algorithm to be used in situations where fragments are located at separate locations with sectors containing data located between them. The algorithm includes the use of a gap between the different sectors, increasing the size of the gap until a valid file sequence is discovered. Unfortunately, the success of this relies on the two fragments having a header and footer located in them to be able to be identified. For fragmentation more severe, this method is less successful as the header and footer may have too much data in between, may not be in sequence with the footer having been allocated to a location proceeding the header Oscar Another technique of file carving fragmented files was first proposed in a paper by Karresand and Shahmehri (2006). This paper proposed a method of identifying fragments belonging to specific file types by their binary structure. This method, coined Oscar by the authors, uses two algorithms to process binary data. In the scenario put forth by Karresand, they used the process to specifically identify JPEG files, however, the process could be applied to other file types (Karresand & Shahmehri, 2006). The first algorithm used in Oscar is known as byte frequency distribution; in which they measured the number of times a byte appeared in a block, allowing them to identify a mean and the standard deviation of data. This allows identification of particular blocks which may contain the file type being looked for by the process. After this is done, a 12

22 search is performed, looking for filetype specific sequences; for example, in JPEG files, specific markers are used consistently throughout the header and body. The second algorithm uses Rate of Change and calculates the difference between the values of bytes which neighbour each other, allowing a frequency table to be calculated to further identify more bytes from a file type. In their study (Karresand & Shahmehri, 2006) they used this algorithm after the first to help identify fragments as being part of the same file after the initial fragments had been identified. Their technique did not rely upon the complete file being found, giving the theoretical application enormous potential. By analysing each individual unallocated block of data, this process is separated from other techniques discussed as at no point the other components of the file needed. The process is very resource intensive, unfortunately, taking a lot of time to process and recover even one file (Karresand & Shahmehri, 2006). The original paper was written for this process to have a military application, where a lot of time can and should be dedicated to recovering data, but unfortunately in a forensic investigation there often is not the time needed for this style of applying these algorithms coupled with the manpower needed for manual analysis. By adapting these fragmented file carving techniques, specifically, by using the principles of fast object validation and SmartCarving s feature based classification, it becomes possible to search for markers which can identify MP3 frames, such as those discussed in section 3.1. By looking for these markers hidden within an obfuscated file, it should allow for the ability to identify MP3 files. 2.3 Cryptographic Hashing and Context Triggered Piecewise Hashing Cryptographic hashing algorithms such as MD5 or SHA-1 allow investigators to identify when a file matches another file, even if the file operating system identifies them as two different files, through the name and extension being changed. Investigators rely on one of the main principles of cryptographic hashes; that the changing of even a single bit in a file will give a radically different output for the hash of that file, meaning that it is computationally unlikely that two different files, even if only a single bit is changed, will produce the same cryptographic hashes (Kornblum, 2006). 13

23 Private businesses and larger organisations such as the American National Institute of Standards and Technology (NIST) have compiled and provide access to libraries of cryptographic hash databases which can be used by investigators to identify the files and software installed on a system (National Institute of Standards and Technology, n.d.). These libraries allow investigators to compare hashes of all the files on a system to those in the database they have collected and rule out known good or normal applications and files from an investigation, and to identify known bad or illicit materials straight away. This will give investigators an immediate path of investigation to follow going forward, both to consider the known bad files to show that these files are pertinent to their investigation, but also to investigate the unidentified files (Richard & Roussev, 2006). The downside to these more traditional methods of cryptographic hashing is that if even a single bit of that file is changed, the cryptographic hash could be vastly different, and in the case of obfuscated files, there will have been relatively large modifications compared to a single bit, made to the file from the point of view of the hash algorithm. In order to combat this potential failure, piecewise hashing was developed to create multiple checksums for a file, instead of the standard single checksum. Early implementation of this was used to verify forensic imaging techniques, to mitigate errors in the imaging process; hashes would be made for specified size sections of the input image at set points. This allowed a verification to be performed on these hashes and if a small had error occurred in the process, then only the sections affected, a potentially small cross section, would be invalidated whilst ensuring the integrity of the rest of the image. In this more traditional method, hashing programs would use set offsets to determine where a hash algorithm would be applied and calculate the hash. To make this a more fluid process, a rolling hash algorithm was developed which produces a value based only on the current input; it takes in bytes and computes a hash based on the currently stored data, this process will continue reading more data and add it to the existing data, processing a hash as it goes. It will continue to process this data until it hits a predetermined number of bytes, whereupon a byte at the beginning will start to be removed for every new byte read in. 14

24 Kornblum defines a methodology for context triggered piecewise hashing in his article Identifying almost identical files using context triggered piecewise hashing (Kornblum, 2006) using an algorithm known as the spamsum algorithm and he implemented this algorithm in his program ssdeep. Context triggered piecewise hashing was, in turn, essentially an amalgamation of this rolling hash algorithm, with the slightly more traditional piecewise hashing method. This process relies on the two hashing processes working concurrently; a rolling hash algorithm will be implemented first, cycling through the data until a specified output or trigger is discovered, whilst a traditional hash is being performed on the file. When the rolling hash hits one of these triggers, the traditional hash is stored as part of the signature hash of the context triggered hashing and the traditional algorithm is reset to continue with only whatever the current input is. What this allows is for very small modifications to a file to be identified as only small sections of the context triggered piecewise hash will vary from the hash of the original file, allowing the file to remain identifiable as the same file, but with some modifications to it (Kornblum, 2006). 2.4 Format of an MP3 MP3 is short for MPEG-1 Layer III or MPEG-2 Layer III, as designated by the Moving Picture Experts Group. The MP3 format is a method of compressing audio, with its popularity stemming from its ability to take large raw audio streams and compress them whilst keeping most of the audio quality. Developed between 1988 and 1992 when it was certified by the International Standards Organisation (Grill & Quackenbush, 2005), as a way of utilising higher sampling rates than those offered by the existing Layer I and II encoders. The smaller compressed files soon became the popular go to audio format, with their shared format birthing the creation of devices specifically designed for playback of the files; MP3 Players. The MP3 format is known as a lossy compression, meaning that it uses a method where some data is discarded during compression to enable a file size to be reduced. Often, with well-designed lossy compression, the end user won t be able to tell that data has been discarded (Benjamin, 2010). Benjamin states that the lossy compression process focusses on discarding portions of the digital data from the original file which are imperceptible to the human ear, for example, anything above 15

25 20kHz or below 20Hz, which is the range of an average person s hearing (Edström, 2008). In the same paper (Benjamin, 2010), the concept of auditory masking is discussed, explaining how a human ears perception of one sound can be affected by the presence of other sounds in similar frequency bands. It is explained that lossy compression, as used by MP3, takes advantage of these facts when deciding how to compress data (Benjamin, 2010). Benjamin (2010) further explains that, with bitrate targets on lossy compression, algorithms aim to compress data into set bit rates, with variable bit rate (VBR) being a much more desirable form. With constant bit rate, every sample gets reduced to a set length, causing compression of simple or complicated streams of audio to the same size, whereas, with VBR the compression algorithm allows simpler parts of a stream where little happens in the stream to be compressed more than other more complex parts. What this means is that the excess bits not used by the simpler samples, could then be used by the more complex samples to allow for less data loss there. Rassol Raissi (2002) expands on this by pointing out that, in the case of some decoders, VBR could actually cause timing issues on the audio track being played back MP3 Frame Structure Rassol Raissi (2002) explains the format of the file to be that of contiguous frames, containing a possible five components as shown in Figure 1. Figure 1 - MP3 Frame Layout (Aziz, 2012) First, there is the Frame Header, then, potentially, a Cyclical Redundancy Check, which only exists if a certain bit in the frame header is set. Next, there is the side information, which is needed to decode the main data, followed by the main audio data and finally an optional ancillary data section. Raissi (2002) also states that each frame stores 1152 audio samples, and lasts for 26ms, equating to roughly 38fps each frame is further divided into 2 granules, each containing 576 samples. With the bitrate determining the size of a sample, it becomes clear that an increased bitrate will also increase the overall size of a frame. 16

26 Information gathered from Raissi s Theory Behind MP3 and a blog by Predrag Supurovic shows that the first part of every frame is the frame header, a 32-bit section, which contains identifying components. These 32 bits have been documented by many sites and can be broken down into an eleven-bit sync header, followed by twenty-one bits that identify different attributes of the file (Raissi, 2002; Supurovic, 1998). The Sync header allows a user to skip forwards or backwards through an audio file at will, whilst letting the file start from the correct point as intended by the user, with the following twenty-one bits each corresponding to a value when interpreted, which can give information about their specific attribute, some of which use different selections of groupings to keep track of important data. The first eleven contiguous bits of every frame header, known as the Frame Sync, are always set to 1. After this first set of eleven bits, the next two bits found are used to identify the version of the file format as shown in Table 1. Bits 12 & 13 Version ID 11 Version 1 (MPEG-1) 10 Version 2 (MPEG-2) 00 Version 2.5 (MPEG-2.5*) Table 1 Version ID Now, technically, MP3 is an MPEG-1 Layer III or an MPEG-2 Layer III, and as such this will mean that, generally, the first twelve bits will always be set, with only the thirteenth bit changing, but it is essential that the more modern unofficial MPEG-2.5 format is known about, as this may be something that could come up when looking in more depth at the make-up of the file structure. Version 2.5 is not supported by some applications, as it is an unofficial format, but it allows for the use of different sampling rates (Supurovic, 1998). The following two bits of the frame header, bits fourteen and fifteen, as shown in Table 2 are used to define the layer used in the encoding process of the file. As mentioned, MP3 is defined as Layer III, yet there is a similarity to MP1 and MP2 files, as such, the frame header has the two-bit combinations shown in Table 2 to help define the file from the other two formats. 17

27 Bits 14 & 15 Layer 00 Reserved 01 Layer III 10 Layer II 11 Layer I Table 2 Layer Description The sixteenth bit in the frame header is the protection bit, this is used to define the use of a cyclical redundancy check. When this is set, there will be a sixteen-byte redundancy check directly after the frame header. This CRC is used to check the sensitive data in the stream for transmission errors. Sensitive data is defined as bits in both the header and the side information. Basically, an error in this information will corrupt an entire frame making that frame entirely unusable by the player, whereas an error in the main data (the audio stream of the frame) contents will only corrupt part of a frame. The next four bits and their status define the bitrate of the track. Each of the sixteen possible combinations of four bits has a corresponding bitrate in kbps, dependent on the version and layer identified above. Bits MPEG V1, Layer 1 MPEG V1, Layer 2 MPEG V1, Layer 3 MPEG V2, Layer free free free free free MPEG V2, Layer 2 & Layer 3 18

28 1111 bad bad bad bad bad Table 3 - Bit Rate Index (all values in kbps) Note that in Table 3, "0000" is indicated as free, meaning it has no restriction on bitrate. On some occasions, one of the other bitrates may not be used. On these occasions, the application used for encoding must determine the bitrate. This is only then used for internal purposes by that application as other third party applications have no means to identify that bitrate. (Identifying these bitrates is not impossible, but is very difficult). An alternative to this is the definition of "1111", this definition should never be found in a frame header as it is an unassigned, unusable variance. As stated earlier, an MP3 is only a Layer III, so only those bitrates which are relevant to that layer become important when investigating MP3 files. The other bitrates are for use in MPEG Layer I and MPEG Layer II files. Much like bitrate, the sampling frequency is determined by an index. This index consists of bits twenty-one and twenty-two of the frame header. As a two-bit structure, it can only have four combinations. These four combinations give the sampling rate in hertz of the audio track, as shown in Table 4, with each MPEG version having a different value for the combinations: Bits 21 & 22 MPEG v1 MPEG v2 MPEG v reserved reserved reserved Table 4 - Sampling Rate (all values in Hz) The twenty-third bit is known as the Padding bit and is used in some frames to satisfy the bitrate requirements. If the padding bit is set the frame is padded with an extra byte, which becomes relevant on certain bitrate and sampling frequency combinations. For example, on an MPEG version 1 file, a bit rate of 128 Kbit/s with a sampling frequency of 44100Hz will create mostly 418-byte frames, with some of only 417 bytes. To fit the bitrate, some of these frames will have to be 418 bytes, in this case, the padding bit will be found to be set in their frame header. 19

29 The next bit is a single bit, known as the private bit and is used by some applications. Generally, this bit is not set, but sometimes it may be set dependent on specific application usage. As it is used for application specific triggers, it is only ever relevant to those applications. The twenty-fourth and twenty-fifth bits define the channel mode of the frame and, as such, determine whether a track uses Joint stereo, Dual channel, Stereo or Mono audio channels, as defined in Table 5. Bits 24 & 25 Channel mode 00 Stereo 01 Joint Stereo 10 Dual Channel 11 Single Channel (mono) Table 5 - Defining Audio Channel Mode The next two bits are only used in joint stereo mode. If a track is identified as "Joint Stereo" in the channel mode bits, then these two bits are used to define a mode extension as seen in Table 6, if it is not set then these bits will be set to 0. Bits 26 & 27 Intensity stereo Middle/Side (MS) stereo 00 Off Off 01 On Off 10 Off On 11 On On Table 6 - Mode Extension for Joint Stereo Audio This next bit defines whether a file is copyrighted or not. If set then the file is copyrighted, if not then it was not copyrighted. This is then followed by a single bit, defining if this frame is located upon its original media. If it is set, then this was the files original media, if not then this is a copy.. The final two bits are used in Dolby sound systems to establish if and how sound needs to be normalised after bass drops or suppressions. The relevant combinations and descriptions can be found in Table 7, but at present these are very rarely used. 20

30 Bits 30 & 31 Emphasis method 0 0 none /15ms 1 0 Reserved 1 1 CCITT j.17 Table 7 - Dolby Audio Emphasis Side Information Side information, found inside every frame, is used to decode the main data of the frame and will vary in size dependent upon the channel mode encoded for the file; if the file is single channel, this information will be 17 bytes long, otherwise it will be 32 bytes long (id3.org, 2012). A major part of the side information is that of the bit reservoir, a technique used in Layer III encoding. Sometimes when encoding there is leftover free space in a frame, this space can then be used by later, larger frames to store data. As a 9-bit sequence, this shows where in previous frames the data can be found. It is a negative offset from the first byte of the Frame Sync and does not include static parts of other frames, such as their frame header (Raissi, 2002). To calculate the range to which data could be stored, prior to this frame, the following function can be used: (2^9 1) 8 = 4088 bits Cyclical Redundancy Check Sometimes in a frame, a Cyclical Redundancy Check can also be found, this is used to ensure that bits of the header and side information are not corrupted. These bits are deemed sensitive data and if corrupt will mean that the frame cannot be decoded at all (Raissi, 2002). 21

31 3 Methodology The aim of this project is the identification and recovery of MP3 frames from obfuscated files, using file carving techniques. Current techniques, such as Garfinkel s Fast object validation and Bi-Fragment Gap techniques (Garfinkel, 2007) discussed in section of the literature review or Poisel s SmartCarving (Poisel et al., 2011) from section , are based on recovering fragmented files. But by applying some of the concepts from these processes, particularly the feature based classifications of SmartCarving or the Oscar method (Karresand & Shahmehri, 2006) to obfuscated files, it may be possible to identify and carve the frames of MP3 files from them. The development of this new technique which allows recovery of file frames may allow investigators to collect data which was unrecoverable in the past, by bypassing the file signature. 3.1 Further understanding the MP3 format As with all other file carving techniques, to develop a methodology for recovering MP3 frames, it is essential that a full understanding of the MP3 file structure is attained. The key aspect of the MP3 file structure is its frame format which has been described in the literature review but is briefly redefined here, to help develop and explain this methodology. Figure 2 shows the basic structure of an MP3 frame. As discussed in section there are five potential parts to a frame; the frame header, CRC, side information, the main audio data, and some ancillary data. Not all frames will have all five of these parts, as both the CRC and ancillary data are optional components, defined by the encoding process. In Figure 2, the frame header can be seen, and the audio data, which in this example is anything not in the frame header. In Figure 2 (id3.org, 2012), the 32-bit frame header has been broken down into an 11- bit sync header, followed by 21 bits that identify different attributes of the file (Supurovic, 1998). The Sync header allows a user to skip forwards or backwards through an audio file at will while allowing the file to begin playing again from the correct point intended by the user. These following bits each carry a value which as described 22

earlier can give information about their specific attribute, some of which use different selections of groupings to keep track of important data.

32 earlier can give information about their specific attribute, some of which use different selections of groupings to keep track of important data. After the initial 11 bits, the next 2 bits give the version identity of the file; MP3 files can be MPEG Version 1, Version 2 or the unofficial Version 2.5. these are followed by another 2-bit set, this 2-bit sequence, at bits 14 and 15 of the frame header, defines the layer being used in the file encoding process, the different combinations available describe the different layers available in the MPEG format, although in any MP3 they will be set to 01 which represents Layer III, the layer used by all MP3 files. After this, the next single bit, defined as the protection bit, tells the decoder whether there will be a cyclical redundancy check built into the frame. This helps ensure that there is no corruption in the frame when decoding is required. The ensuing 4-bits define the bitrate of the frame, with each potential combination having a corresponding bitrate, dependent upon the file version and layer, followed by a 2-bit sequence for the sampling frequency and a single bit to denote the padding of a frame. The remaining 8 bits are made up of a 2-bit channel mode sequence, followed by a 2- bit mode extension definition, a single bit defining copyright and a single bit to show if the file is on its original media, with a final 2-bit sequence defining the emphasis used by a joint stereo file. Figure 2 - MP3 Frame Structure (id3.org, 2012) 23

33 3.1.1 Frame Length and Size Once this frame header is identified the individual frames of an MP3 can be found, these frames are known to hold the main data of a track and this frame structure is repeated continuously throughout the file. To be able to identify the individual frames of an MP3, it must be possible to calculate the length of each frame. Understanding the difference between frame length and frame size is important. Frame size is the number of samples in each frame. This value is a constant: o 384 samples for Layer I. o 1152 samples on Layer II and Layer III. Frame length defines the length of a frame when compressed and is calculated in slots. One slot is 4 bytes long for Layer I, and one byte long for Layer II and Layer III. If a file is a dual channel file, defined by bits 24 and 25 in the frame header, then the bitrate will be split evenly between the two channels, resulting in 576 samples channel, per frame on a Layer III file (Raissi, 2002) and always calculates to 26ms per frame Calculating frame length Now knowing the bitrate and sample frequency, identified from the mentioned bit combinations, this information can be used to calculate the frame length of the frames in the file (Raissi, 2002). For Layer I MPEG files use this formula: 12 Bitrate o Frame Length in Bytes = ( + Padding) 4 Sampling Frequency For Layer II & III MPEG files use this formula: Example: o Frame Length in Bytes = o 144 Bitrate Sampling Frequency + Padding For an MP3 frame with a Bitrate of 128Kbps, a sampling frequency of 44100Hz and not being padded would give a frame length of 418 bytes. 24

34 3.2 A process for finding frames of an MP3 File Having developed the understanding of how the MP3 structure is formed, an attempt to apply this knowledge to a normal MP3 needed to be made. The best way to do this would be by looking at an MP3 file, in a hex editor, such as xxd, to look for the information outlined above. To this end, the file MP3Test.mp3 was opened in HxD, a standard hex editor. The Linux tool xxd offers the option to instead do a binary dump of a file, which sounds potentially useful at first, as the first section of each frame is essentially a 12-bit sequence set to 1. Unfortunately, finding this 12-bit sequence in a binary string becomes unrealistic; reading the file in binary, it is quite possible to get twelve consecutive bits set at 1 which are not actually the start of a frame header, and it is even possible to obtain bits from the end of one byte and the rest from another byte, or vice versa. As such it is easier to look at this in a hex editor and calculate the conversions to and from binary ourselves. Given the information above and, by assuming the file is not one of the rare version 2.5 files it is relatively safe to say that the twelve bits will be set to 1. If it is one of the version 2.5 files, the last bit of these twelve will be unset, or 0. Secondly, at this point, it is known that the next bit can be either a 1 or a 0 dependent upon version encoding. As was established previously, an MP3 will always be in Layer III format, so the next two bits will be 01. Finally, the protection bit has a fluctuating status here. It might be set or not, but more importantly, it can vary across every frame, it is entirely possible that one frame is protected, whilst none others are, or it could change every other frame. As such, it is necessary to calculate the combinations of bytes there could be for the first 16 bits, or 2 bytes, of the frame header. This is best shown in Table 8, which shows the possible combinations of bits, and their hex equivalent. 25

35 V1 Layer V1 Layer III, V2 Layer V2 Layer III, V2.5 Layer V2.5 Layer III, Unprotected III, Unprotected III, III, Protected Protected Protected Unprotected Sync (11 bits) Version Identifier (2 bits) Layer Description (2 bits) Protection Bit (1 bit) 16-Bit Binary Hex Conversion FFFB FFFA FFF3 FFF2 FFE3 FFE2 Table 8 - Frame header possibilities: First 16 bits Now that the potential opening sections of frame headers have been identified, it is possible to find these in a file opened with a hex editor, by searching manually for them and seeing which comes first, as this will usually identify the first frame header. Once one of these two-byte hex sequences have been found, the next two bytes in sequence can be taken and converted to binary to get the other 16 bits of the frame header. In the case of the MP3Test.mp3 file, the pattern was found to be FF FB followed by as shown in Figure 3. Figure 3 - First frame of MP3Test.mp3 Converted to binary, gives and in cross-reference with Table 8 it is seen that FF FB, is an MPEG-1, Layer III; using Table 3 and Table 4, it is possible to identify the bitrate and sampling frequency of this frame. 26

36 Bits 17 20, in this case 0100, denotes a bit rate of 56kbps or 56,000bps and bits 21 & 22, again identified as 00 for this frame, identify as a sampling frequency of 44100Hz, with the padding bit having not been set on this frame, giving a calculation of = 182. Meaning that, as can be seen in Figure 3, this frame is 182 bytes long. Not only does this calculation tell the frame length, but it also indicates the start of the next frame header which will come immediately after this frame. Given the same treatment to every frame header, it is then possible to find each subsequent frame header. The file shown in Table 9 was a variable bit rate file, chosen specifically for a test case here as the frame sizes will vary in length, showing the necessity of being able to check every frame length individually. If it had not been possible to identify this first section as a frame header, it would have been necessary to continue to look for the next iteration of one of the 2-byte hex values, calculating the frame size in the same manner until it corresponded to a location in which the frame continued into the next frame contiguously. Bits Binary Hex F F F B Attribu te Value M P3 Sync Frame Header Version Encoding Layer CRC protection Bit Rate Sampling Frequency Padding Private M ode M ode Extension Copyright Original Emphasis M PEG V1 Off Off Off None No None Sync Frame Header Layer III 56 Kbps 44100Hz Stereo Off Table 9 - Breakdown of first frame of MP3Test.mp3 The first 23 bits of the frame header are the only bits in the frame header that add value for the purposes of identifying the file frames and can be used to discover the fragments of a file. The other bits may be of use in identifying fragments as belonging to the same file, as it is unlikely a file will switch its channel mode or it s copyright identity mid-file. Yet given the frequency at which these may be used in common MP3 files, they are unlikely to help with recovery and identification of fragments. This recovery technique is very similar to the SmartCarving technique (Poisel et al., 2011), by using these similar features in what is essentially feature based classification. Unlike most file carving tools and techniques, this methodology allows an opportunity to look at files which may not contain a file header, or in which the file header may be obfuscated, and there is no need to rely on finding this to show us where the file begins. Thankfully, in MP3 files this isn t essential as the data stored 27

37 between the file header and the first frame is only metadata, and therefore not essential to the playback or decoding of the file, and if the file has been obfuscated, may not even contain anything relevant to the file. As described in Section the size of all MP3 frames equates to a chronological 26ms long, no matter the length of the frame, meaning that for any single second of MP3 audio there would be around 38 frames, with any single MP3 having hundreds or thousands of frames dependent on the length of the audio track. For the purposes of a 3-minute audio track, there would be approximately 6840 frames found within that file. Whilst this manual methodology described could easily be used to identify and recover the frames of a file, with numbers of frames like this, manual identification becomes unrealistic because the time needed for a single person to sit and discover these files would be too great. Because of this, code needs to be developed that allows an automated application to identify and recover frames before writing them to another file. 28

38 4 Automating and Implementing the Recovery Process To develop this automated process, it was decided that the scripting language, Python, would be used. The reasons for this were two-fold; firstly, the language is a very versatile language, with relatively simple syntax able to perform complex actions, and secondly, it allows for cross-platform support very easily. Users need only have the same version of Python installed, with any requisite libraries, to be able to use the process on their operating system. The process was developed using Anaconda and Spyder on the Ubuntu operating system, but it was decided that its implementation needed to be possible on other operating systems so that it could be used in future by investigators, without them needing to use the exact same setup Identifying file MPEG version To start, before discovering the frames, the process must be able to identify the version of the MP3 file so that when it does later identify frames, it can calculate the bitrate and sampling frequency from pre-described values. To do this the file being investigated needed to be read and searched for the frequency of the markers that define the different versions of MP3 as defined by Table 8. The possible versions of the MP3 format can be defined as Version 1, 2 or 2.5, further separated by whether they are protected or not. 1 versions = {'fffa':'version 1, unprotected', 2 'fffb':'version 1, protected', 3 'fff2':'version 2, unprotected', 4 'fff3':'version 2, protected', 5 ffe2 : Version 2.5, unprotected, 6 ffe3 : Version 2.5, protected } Figure 4 -Version Identification Definitions Each MP3 itself may have any number of each of these identifying aspects found within it, but each frame has one of these identifiers in its frame header, meaning that whichever of these identifiers is the most common, is likely able to indicate the correct version for the file. A secondary aspect to be considered is the placement of these identifiers. The first instance of one of these identifiers should also identify the file version, so it stands that if the most common version identifier is also the first in a file, then this is the correct version identifier for that mp3. 29

39 To define these aspects the following code is used: 1 try: 2 for each in KnownVersionID: 3 first.setdefault(each, min([m.start() for m in re.finditer(each, FileToRead)])) 4 count.setdefault(each, len([m.start() for m in re.finditer(each, FileToRead)])) 5 if min(first.iteritems(), key=operator.itemgetter(1))[0] == max(count.iteritems(), key=operator.itemgetter(1))[0]: 6 if max(count.iteritems(), key=operator.itemgetter(1))[0] in versions and min(first.iteritems(), key=operator.itemgetter(1))[0] in versions: 7 return versions[min(first.iteritems(), key=operator.itemgetter(1))[0]] Figure 5 - Version Identifier Code Identifying frames After identifying the version of the MP3, the next stage was to identify the location of frames within the file. The easiest method of doing this is to define the offset at which the first and last locations of the frame header can be found. 1 strt_frame = min([m.start() for m in re.finditer(version, FileToRead)]) 2 last_indx = max([m.start() for m in re.finditer(version, FileToRead)]) Figure 6 - First and last indexed locations The frame count must be kept, not only to be able to keep track of the number of frames recovered, but also to identify the frames in the correct order for the recovery process. once these locations are defined, the file is read to the location of the first identified frame, whereupon, if the frame count at present is zero, the first frames entire frame header is read and analysed for information using the defined values set by the coding tables. 1 ##Gets the Frame header and analyse the bits to determine the bit rate, sampling frequency and if padding is set, 2 ##then uses this to determine the framelength 3 frameheader = RecoveryData[firstFrame:(firstFrame+8)] 4 frameheaderbin = StringConversion(unhexlify(frameHeader)) 5 Bitrate = FHA.BitrateCalc(frameHeaderBin) 6 SampleRate = FHA.SampleFreqCalc(frameHeaderBin) 7 Padding = FHA.PaddingBit(frameHeaderBin) 8 FrameLength = 144*Bitrate/SampleRate+Padding 9 ##Based on the frame length calculated above, returns the entire first frame of the mp3 audio stream 10 frame = RecoveryData[firstFrame:(firstFrame+(FrameLength*2))] Figure 7 - Identify frame header and calculate Frame Length After obtaining the bitrate, sampling frequency and the state of the padding bit, the code will then calculate the length of the frame. This calculation is laid out previously, but multiplied by two, due to the requirements of python reading singular characters, 30

40 essentially nibbles rather than individual bytes, at which point it reads the entire frame from the file, before storing this frame in a secondary location and passing back the location of the next frame header to the system. This location for the next frame header is based on the calculation of the length of the frame and will be directly after the previous frame, as defined earlier. The next frame location is then checked by the system, to ensure that the header matches the expected frame header sequence already identified by the system, at which point the system repeats the process of identifying the bitrate et al, to calculate the frame length, recover it and identify the following frame header. This process is then continued by the system until the next frame header location either matches, or falls after, the last known location of a frame header, or the end of the file is reached, as at this point all the frames of that file have been identified and recovered. During development of the manual methodology, although a certain amount of understanding had been gleaned from the various sources studied in the literature review about MP3 file structures, it was necessary to manually investigate an MP3 file to see these structures. During this manual investigation, it was discovered that although rare, it sometimes happens with some encoding that, although a padding bit is set, the extra byte might sometimes not be found in the frame or vice versa. This meant that when calculating the next frame, the frame header could not be identified correctly, as it may be off by a byte. To combat this, an error check was implemented into the code to ensure the frame header matched the correct version, and if not, to check the bytes one byte backwards, or forwards, to ensure the correct starting location of the next frame. 1 elif FileToRead[(nxt_frame-2): (nxt_frame +2)] == Version: 2 nxt_frame = FrameFind.SecondaryFrames(FileToRead, (nxt_frame-2), RecoveryFile) 3 frame_count += 1 4 elif FileToRead[(nxt_frame+2): (nxt_frame +6)] == Version: 5 nxt_frame = FrameFind.SecondaryFrames(FileToRead, (nxt_frame+2), RecoveryFile) 6 frame_count += 1 Figure 8 - Error check to ensure correct Frame Location 31

41 4.1.3 Incorporating Context Triggered Piecewise Hashing In recovering the frames of an MP3 file, it was still necessary to assess the file to identify these recovered frames for their original audio file details and it was decided that the best way to do this was by using Context Triggered Piecewise Hashing (fuzzy hashing). Initially, it was thought that the best way to do this was to use the standalone tool, ssdeep, but further investigation showed that a module of ssdeep was available for the Python scripting language. The first task was to use this Python ssdeep module to create a database of fuzzy hashes of known MP3 files. To do those, a library of MP3 audio files was needed. This library was created from a selection of sources; some public domain tracks, coupled with some tracks digitally downloaded from purchases on Amazon.co.uk and fleshed out with audio tracks burnt directly from audio compact discs. By burning select audio tracks from audio CDs, it allowed multiple versions of the same track to be burnt, but with different bit rates, allowing the investigation to be performed identifying the correct bitrate version of a file when recovering them. The Python ssdeep module can not only hash a stored variable or input but also has a function that can read and hash a file directly. Once this library had been collected, this HashFromFile function could be used to fuzzy hash each of the files with that hash then being stored in a text document, to be used as a database when investigating the recovered file during the investigation. Having built a comparison database of known MP3 files, code could be developed that hashed the file formed from frames recovered during the investigation process of the investigated files, against this database, to inform the investigator of which file was the original MP3. 32

42 1 if type(numberframes) == int: 2 print '\n[t] Recovery process on file %s took: %s seconds' % ('\'%s\'' % branch, round(time.clock() - filetimestart, 2)) 3 print '[R] No. of frames recovered: ' + str(numberframes) 4 with open(hashlibfile) as MP3FuzzHash: 5 next(mp3fuzzhash) 6 for line in MP3FuzzHash: 7 FuzzHashLib.append(line) 8 hashrec=ssdeep.hash_from_file(recoveryfileloc) 9 for each in FuzzHashLib: 10 result.append((each.split(',')[1][1:-2], ssdeep.compare(hashrec,each))) 11 for each in result: 12 if each[1] >> match[1]: 13 match = each Figure 9 - Fuzzy Hash and Comparison code This fuzzy hashing process works by first loading the file that recovered frames were written to, hashing the file and then using the ssdeep compare function to compare the fuzzy hash of this recovered file against those in the database. Each of the hashes in the database may match partially on a similarity scale, so next the code updates the match value every time a greater match is found, ultimately returning the closest similar match as the resulting original MP3 which had been obfuscated Creating a user interface Once these methods had been developed, it remained necessary that a method of using this process be designed and implemented. The process needed to be able to be run by a user in order for files to be investigated and recovered. To do this, simple code was written to use the system (sys) module built in on Python. In this code, a user would call the Python script from the command prompt, giving the type of interaction they wanted to use, whether to search a single file or an entire mounted disk image. 1 if len(sys.argv)!= 2: 2 print '\nusage: FindObfuscation *mode*' 3 print '\tprovide the operating mode:' 4 print '\t\tto check all files on a disk or image, use: -i or -image.' 5 print '\t\tto check an individual file, use: -f or -file' 6 else: 7 if sys.argv[1] == '-image' or sys.argv[1] == '-i': 8 disk_examiner() 9 elif sys.argv[1] == '-file' or sys.argv[1] == '-f': 10 file_examiner() Figure 10 - Initial user interaction 33

43 This method ensured the user called the program interaction in the correct way, allowing the system to load the correct function for using the recovery process. The individual file check uses the method as above, whereas the image check needed to look at each file on the image individually, applying the recovery process to each of the files identified in the image in turn. This method of loading an image allows future investigators to load an image acquired through investigation, into the process and scan each file on that image for the possible frames in a file, before identifying them and recovering the frames to a new location. 1 ObfFile = raw_input('\n[*]enter the path of the potentially obfuscated file you want to identify frames from for recovery: \n[*]:') 2 if os.path.exists(obffile): 3 RecoveryFile = raw_input('\n[*]enter filename for where you would like any recovered frames to be stored.\n[*]:') 4 HashLibFile = raw_input('[*]input the file to use as a hash comparison library for identification of MP3 frames.\n[*]:') 5 if os.path.exists(hashlibfile): Figure 11 - File Mode: User defined file and recovery storage Once the mode of investigation has been decided by a user, the system will then take input from the user in the form of the path to the location of the file, or the mounted image. After that input is given by the user, the process uses the Python inbuilt module os to identify if the path to that file or disk image exists, ensuring that the user has made no spelling mistakes in the path, the file has not been moved or image unmounted. Following this, the system will ask the user to identify where they would like to store any recovered frames. For an individual file, this will be a filename ensuring the frames can be written out to a file, if the system is in image mode, it will only ask for a directory to store any recovered frames, the system then uses a default filename for each file, writing the frames from recovered from each file investigated to this default filename, with an increasing number appended to each recovered file. 1 img_mnt = raw_input('\n[*]enter mount point of the disk or image containing files you want to identify frames from for recovery.\n[*]:') 2 if os.path.exists(img_mnt): 3 recoverydir = raw_input('[*]enter directory you would like any recovered files to be stored.\n[*]:') 4 if os.path.exists(recoverydir): 5 HashLibFile = raw_input('[*]input the file to use as a hash comparison library for identification of MP3 frames.\n[*]:') 34

44 6 if os.path.exists(hashlibfile): Figure 12 - Image Mode: User defined image and recovery storage If the system identifies the path to these locations as existing, it asks the user for the location of a previously created database of hashes, again checking the path to this database is correct as it will check the context triggered piecewise hash (fuzzy hash) of the file, or files if in image mode, against this to identify the original file or files, before obfuscation. After these file or image paths, recovery locations and fuzzy hash database location have been defined by the user, the system then runs through the recovery process, until finished, when it gives an output telling the user where recovered files (if any) can be found, based on the user's prior input. At this location, a text file can be found describing in it each file that was checked for MP3 frames and the number of frames recovered, including telling the user that a file was not an MP3 if it did not have any frames in it. Each file identified as an MP3 will also have a comparison performed by the fuzzy hashing process, and in this text file, for each file recovered, will be listed the file that the fuzzy hashing process has identified as this files most likely original MP3 counterpart, before obfuscation, along with the similarity match of the frames to the probable original file. 35

45 5 Defining a Testing procedure Having developed the recovery process, a testing procedure was needed to confirm the method works as intended, in an effective, efficient time frame. To test the manual methodology, a single file will be used, named MP3Test1.jpg, this is a file consisting of an MP3 file obfuscated to appear as a JPEG file, by appending the header, 0xFFD8, to the beginning of the file and by also appending 0xFFD9, the footer of a JPEG file, to the end of the file. The metrics from section 5.2 will be used for this manual methodology and for the automated process. For the automated process, as the system was written to work with both images and individual files, there will be a brief test of the tool on individual files, with the main test being to specifically test the recovery process of files on disk images. Whether performing operations on disk images or files, it uses the same process to identify and recover the files. To do this correctly, a collection of 5 forensically sound USB disk images were created with various files on them, to provide different testing scenarios for the process to contend with. 5.1 Developing Test Images The first disk image was an image filled with unedited MP3 files, with no obfuscation. This image was made like this to give a baseline for the system, to show it being able to identify multiple MP3 files if they are already, known to be MP3 files, and show that the frames once recovered do match the original file from the hash database. Image two was created containing only standard JPEG and PNG files, these files were used for this image as they allowed the system to show that files not containing MP3 frames would be identified as non-mp3 files. This was intended to give an alternative baseline view for the system showing what would happen if no MP3 frames were found for files on the system. For image three it was decided that the image be filled with MP3 audio files obfuscated to look like JPEG image files. To obfuscate these MP3s, the JPEG header 0xFFD8 was appended to the very beginning of every file, with the footer 0xFFD9 being appended to the end of each file, no other changes were made to these files before 36

46 being added to the USB and being imaged. By obfuscating these files as JPEG files, it allows us to hide the file behind an image type that matches reasonably well in file size of normal JPEG files, allowing the obfuscated files to potentially hide if stored in a location with a lot of other JPEG files. For the fourth image, it was decided to obfuscate the MP3 files as system files or DLL files. Because system files are generally critical files, a close investigation into the locations they are stored is less likely, with people generally avoiding the areas to avoid affecting critical systems. As these storage locations are usually away from places where most users store data if properly obfuscated the chances of finding these files is relatively low during an investigation. To obfuscate the MP3 files as DLL files, the file header signature, 0x4D5A was appended at the beginning of each file, with no footer being added as there is no footer for this file type. Finally, the fifth image was designed to bring all these together, the image contained normal Jpeg and Png files, along with MP3 files obfuscated as both JPG and DLL files. USB Image Contents Obfuscation (including method) Image 1 MP3 Files None Image 2 JPEG Files PNG Files None Image 3 JPEG Files MP3 -> JPEG Obfuscation (Header = 0xFFD8, Footer = 0xFFD9) Image 4 DLL Files MP3 -> DLL Obfuscation (Header = 0x4D5A) JPEG Files MP3 -> JPEG Obfuscation (Header = 0xFFD8, Footer = 0xFFD9) Image 5 MP3 -> DLL Obfuscation DLL Files (Header = 0x4D5A) JPEG Files PNG Files None Table 10 - Defining the Test Image This selection of images was decided upon based upon how other software systems are tested, and informed by other investigations such as Maguire s File Carving Mp3 Fragments (2015). By allowing the images a variety of investigation for the testing process, it showed the ability for the system to show how it handled different situations. 37

47 By creating images one and four to contain only known files, in their original file types, it is possible to positively show how the system deals with files containing definitive MP3 frames, and those containing no MP3 frames. Images two and three allow the recovery process to show how MP3 frames can be identified and recovered from files which appear to be non-mp3 files to all other sources. Finally, the last image, allows the system the chance to process different files and file types, with some containing MP3 frames whilst others do not. By allowing this variety in the last image, the system can show how it handles the process of identifying and recovering the frames of MP3 files, whilst also identifying non-mp3 files. The contents of each individual image were decided on to simulate the standard normal and extreme methods used in software testing. To ensure the testing process is forensically sound, a formatting and zeroing process was implemented, to be applied to the USB before starting the process, and between each image being created. The process that the USB zeroed using the Linux tool Shred, with the -fz flags in use when running the tool on the USB. The -f flag forces the permissions to allow writing on the drive, to ensure that nothing escapes the process no matter the set permissions, whilst the -z flag will ensure a final overwrite of the disk with zeroes, to ensure the system has no hidden data. After the zeroing process, the USB is repartitioned using the tool GParted, formatted to FAT32 to allow the use of the image across multiple platforms if needed. After this process, the USB can be loaded with the required files before being captured as a bit identical image, by using the Linux tool DD. This helps to ensure less risk of any corruption of the files or file structure of the image, by zeroing the USB before each setup of the image, it allowed control over the data on the drive, ensuring that only those files which were wanted for each image were present. 5.2 Choosing Testing Metrics To be able to proficiently test the effectiveness of the methodology upon the created images the results needed to be measured. To measure, this some metrics for testing this needed to be decided upon. Some few metrics for this testing were suggested by Maguire (2015), for the work being done in that paper, looking at recovery of fragmented MP3 files in unallocated space, the speed of the process, the size of the 38

48 recovered file and the audio quality of that recovered data were important and therefore were used as the main metrics for that paper. In the testing of this process, these metrics are intriguing, but not all of them are that useful. Audio quality is defined by the bitrate and sample rate of the original encoding, but as this process will either identify frames, or it won t, this will not be a relevant metric, the quality would be predetermined, and not affected Fuzzy Hash Comparison Easily the most relevant metric for this process is how closely a comparison can match an existing MP3 against the recovered frames from the investigated file. The higher the comparison match of a recovered file against a known file, the more likely it is that the recovered frames were correctly matched to the file they appear to have come from. Ideally, given the only data we should lose is metadata, this comparison should be above 90%, anything less than this is unlikely to have yielded a correct match, as it means that more than 10% of the file is different to the original file, which should not be possible when carving contiguous blocks of a file, a fact that this process relies upon as all frames of an MP3 are contiguous, to allow for correct playback Recovered File Size When recovering the frames of an MP3 file, the process used could lose a certain amount of data at the beginning or end of the file, because of the way it ignores metadata and non-audio data before the frames or after the last frame. As the ignored data would invariably be a small size - because metadata rarely takes up much physical space on a file - this can give a cursory indication to the investigator that the file was recovered correctly; if the file size is vastly different then something probably went wrong during the identification and recovery Speed of Process The speed of carving, whilst not affecting the results directly, is still an important metric to measure when considered as a forensic investigation. Forensic Investigations have a limited amount of time and need to be able to process data in as little time as reasonably possible. The time taken in minutes or seconds to carry out carving both 39

49 manually and by running the automated tool should be measured in order to test the efficiency of the method Playability of the Recovered File The quality of the audio will, assuming recovery was correct, depend upon the bit rate and sample frequency of the original file. As such, it becomes a relatively limiting metric to use a direct method like Audio Quality as a means of testing the process. Instead, an assessment of the file will be made to assess whether they were recovered correctly, by listening to playbacks of the files. Playback of recovered frames is possible without recovering a header for the file because, as explained in section 2.2.1, MP3 files rely on the availability of the frame headers for playback, rather than using their file header, the ID3 tag, which acts more as a metadata container, than a true file header. 40

50 6 Results 6.1 Testing the Manual Recovery Process In testing the manual methodology, the first task was to identify the beginning of the first frame of the file. To this end, the file was opened in a hex editor, looking first for the potential frame header. once the first frame header was found, calculations were performed to identify the next few frames, to ensure the correct frame header had been identified. As explained in the methodology description stage, the number of frames quickly becomes too high to manually calculate each frame, instead, once the first 20 or so frames had been identified to prove the method worked, the file was skimmed through to the end, calculating the final frame from the file. Once the first and final frame had been confirmed, the contents between were copied to another file, giving the new file RecoveredFile1.mp3. Recovering the frames to this file proved that the manual process, and therefore the theory behind the automated process, works, allowing us to identify the frames in the obfuscated file and create a file which could be opened in VLC media player as an audio file, and able to playback. Testing the original obfuscated file using the Linux file command or the Sleuth Kit tool, sorter, the file appears to be a JPEG file, using the same tools on the recovered file, after the process was complete, the new file is recognised as an MPEG Layer III.. Original File Obfuscated File Recovered File Dr. Dre feat. Eminem - Forgot About Dre 224Kbps.mp3 MP3Test1.jpg RecoveredFile1.mp3 Size of file 6225KB 6226KB 6077KB Length of file (m:ss) 3:42 N/A 3:42 Playback quality N/A File Plays back ok Table 11 - Manual Test recovery MP3Test1.jpg As seen in Table 11, the recovered file is slightly smaller in size than the obfuscated file or the original MP3, but the length of the file is identical. This is because the data lost is only metadata, and does not affect audio track, allowing playback to be the same as that of the original file. Also, the Comparison using ssdeep shows a 97% similarity between the recovered frames and the original file, this 3% difference essentially being that data lost from the metadata before the frames were recovered. 41

51 6.2 Recovering using the automated process The next stage of testing used the automated process, first on four individual files and then applied to the five images defined in section 5.1. For the image test procedure, the tool developed in section 4 was run via terminal in Ubuntu, with each image mounted in read-only mode. To enable the system to identify each individual file on the image these images needed to be mounted, but by doing so in read-only mode, it was possible to ensure no changes were made to the files Automated Testing on individual Files Size of file Length of file (m:ss) Playback quality Original File Obfuscated File Recovered File Dr. Dre feat. Eminem - Forgot About Dre 32Kbps.mp3 MP3Test1.jpg RecoveredFile1.mp3 868KB 868KB 866KB 3:42 N/A 3:42 N/A N/A File plays back ok Table 12 - Automated Test 1 Identification and recovery 8512 frames 0.42 seconds Comparison Match of recovered file to original file 97% Match Size of file Length of file (m:ss) Playback quality Original File Obfuscated File Recovered File Avril Lavigne - Complicated - 96Kbps.mp3 MP3Test2.jpg RecoveredFile2.mp3 2868KB 2868KB 2866KB 4:04 N/A 4:04 N/A N/A File plays back ok Table 13 - Automated Test 2 Identification and recovery 6793 frames 0.37 seconds Comparison Match of recovered file to original file 97% Match 42

52 Size of file Length of file (m:ss) Playback quality Original File Obfuscated File Recovered File Kingdom Hearts - Dearly Beloved.mp3 MP3Test3.jpg RecoveredFile3.mp3 5422KB 5422KB 5366KB 3:01 N/A 3:01 N/A N/A File plays back ok Table 14 - Automated Test 3 Identification and recovery 6933 frames 0.51 seconds Comparison Match of recovered file to original file 97% Match Size of file Length of file (m:ss) Playback quality Original File Obfuscated File Recovered File Dr. Dre feat. Eminem - Forgot About Dre 224Kbps.mp3 MP3Test4.jpg RecoveredFile4.mp3 6225KB 6225KB 6077KB 3:42 N/A 3:42 N/A N/A File plays back ok Table 15 - Automated Test 4 Identification and recovery 8512 frames 0.61 seconds Comparison Match of recovered file to original file 97% Match As can be seen from Table 12 - Error! Reference source not found., the recovery tool was able to identify the frames in all four test files, which had been obfuscated as JPG files before recovery Image 1 MP3 files in native format Using the recovery process on the first image, the system proceeded through each file on the image, in turn, first assessing whether the files were MP3, which in this instance they obviously were, as the image was created with only MP3 files. Filetype on Image Number of files on image Number of recovered MP3 files Average Comparison Time Taken to recover files on Image Non-obfuscated, % seconds standard MP3 Table 16 - Results of Image 1 recovery As can be seen in Table 16, the recovery process, when running in full on image 1 of the testing scenarios, was able to identify all thirty-three MP3 files as being MP3 files, it was then able to recover the frames from these files into individual files per test file, 43

53 and use the fuzzy hashing process to identify each of the files with a decent comparison level. A 96% average comparison implies the recovered files were all almost identical to their original file contents, which is to be expected as it was known what each of these files was, and that they were MP3 files, but it does bode well for the recovery of later files Image 2 Standard, unedited JPEG/PNG files The results for Image 4 contained a strange anomaly. This image had been created with real JPG and PNG files, with no obfuscation or editing being done to them. no files were recovered from this image, which was expected but the image had 60 picture files on it and the results only recorded 56 files being processed, before the process finished. This anomaly was due to a PNG file appearing to have a frame header identifier in it, with no follow-up frame header info, or data that could be seen as a frame. After this, the program moved on to continue checking the rest of the files. Filetype on image Number of files on image Number of files identified as potential MP3 files Number of files recovered as MP3 files Average Comparison Standard JPG N/A Table 17 - Results of Image 4 recovery This Image has no comparison match values as no MP3 files were recovered to be compared against the fuzzy hash database, which is what was expected to be seen Images 3 and 4 MP3 files obfuscated as JPEG and DLL files As stated in section 5.1, image 2 has been built with MP3 files obfuscated as JPG files. These files were obfuscated by appending 0xFFD8 and 0xFFD9 to the beginning and end of the file as headers and footers of JPG files. Alongside this, Image 3 was built using the same selection of MP3 tracks but obfuscated to appear as DLL files, by appending 0x4D5A at the beginning of the file. Filetype on Image Number of files on image Number of recovered MP3 files Average Comparison Time Taken to recover files on Image MP3 Obfuscated % seconds as JPG Table 18 - Results of Image 2 recovery 44

54 The selection of MP3s that had been obfuscated for these two images was identical, although a slightly different selection than those on Image 1, they were still taken from the same database that had been collected in the fuzzy hash list. As can be seen in Table 18 and Table 19 this did alter the similarity match slightly, giving an overall average match of 97%. If Appendix 4 is looked at, it can be seen that in this case, some of the tracks returned a 100% similarity match, meaning that no data was lost in the recovery of the file. What this indicates is that these particular tracks had zero metadata, before the frames in the obfuscated file, in this instance because these tracks were public domain music, and therefore had no artist or source metadata to see. Filetype on image MP3 Obfuscated as DLL Number of files on image Table 19 - Results of Image 3 recovery Number of recovered MP3 files Average Comparison Time Taken to recover files on Image % seconds Image 5 Mixed JPEG/PNG files, MP3 files obfuscated as JPEG and DLL files The final scenario for image-based identification and recovery is Image 5. As previously stated, this Image is made up of different file formats, it has two distinct kinds of obfuscation in play on the MP3 files, JPG obfuscation and DLL obfuscation. Alongside these obfuscated MP3 files, the image also contains many standard JPG files. The intention of this mixture of different obfuscated file types, with some non-mp3 files, is to show how easily the process handles the different files and file types. By not bothering with the file signatures, the automated process is able to identify the hidden MP3 frames from within the obfuscated files, no matter the type of obfuscation. In the same manner, because it ignores the signature of the file and instead looks for MP3 frame structures, it can identify that the normal JPG files are not MP3 files. As far as the automated process is concerned, it doesn t matter what these other file types are, just that they re not MP3 files. 45

55 For this scenario, all the MP3 files in our database were obfuscated with both a JPG and DLL version being created of each file, the files written to the image were then selected at random from these 2 groups of obfuscated filed, this ensured that there were no influencing expectations on the results, and has meant that in some instances, the same original MP3 has been discovered as the resulting recovery match for some of the files. Number of Number of recovered Average Filetype on Image Files on image MP3 files comparison MP3 Obfuscated as DLL 97% MP3 Obfuscated as JPG Unedited, 30 0 N/A standard JPG Table 20 - Results of Image 5 recovery Time Taken to recover files on Image As can be seen in Table 20, the system was able to identify the MP3 frames of every file in both sets of obfuscated files, it was then able to recover all these files and perform a comparison match on each of them. It can also be seen that the JPG files were all identified as not matching the MP3 file structure, and therefore were ruled out of the recovery process by the program. The average of the comparison match once again came to 97%, likely due to two of the MP3 files being from the selection of public domain files mentioned earlier, and giving individual similarity matches of 100%. Due to the size of the results, because there were so many files per image, a full read out of each file recovered from each image has been included in Appendix 4 Test Results to keep this section clear, for a more in-depth view of each files comparison match, please look there Test Summary Overall, as can be seen in Table 21, the recovery process successfully identified and recovered the MP3 frames of all files that contained MP3 file data. Across all the images this included those files that had been obfuscated and those that had not. It was also able to successfully identify non-mp3 files, and ignore these files. 46

56 After successfully identifying MP3 frames where they existed, and then recovering these into playable files, the fuzzy hashing process successfully identified all the files that had been used, providing a matching filename of the original file before obfuscation. These matches were overall very high with an average of 96.75% across those images with MP3 files, some of the files matched at 100% due to having had no metadata before obfuscation, the other files had varied matches with 93% at the lowest, due to higher amounts of metadata being present in the original file, and therefore more data lost upon recovery. Image 1 Image 2 Image 3 Image 4 Image 5 MP3 Files 33 JPG Files MP3 Files Obfuscated as JPG files MP3 Files Obfuscated as DLL files MP3 Files Recovered Average Similarity Match 96% N/A 97% 97% 97% Table 21 - Summary of automated test process 47

57 7 Evaluation The primary aim of this project was to develop a process for identifying and recovering MP3 data from obfuscated files, to do this the basic principles of file carving needed to be understood, along with the underlying file structure of the MP3 audio format. In researching these two aspects it became possible to not only develop a viable method for doing this identification and recovery but also to design and create an automated recovery tool. This automated tool implemented a deep understanding of the MP3 audio file structure, alongside a file carving technique based on both the SmartCarving technique outlined by Rainer Poisel (2011) and the Fast Object validation technique suggested by Garfinkel (2007). By understanding the frame structure that makes up the MP3 file format, and identifying the frame header within these frames, this investigation theorised a method that enabled the identification of obfuscated MP3 files, even if no other tool available could see them as MP3s. It used the data stored in these frame headers to assess the length of each identified frame and write that frame out to a file, allowing the recovery of the original MP3 file. After identification and recovery had been processed it was able to use the Context Triggered Piecewise Hashing tool, SSDeep, to compare the recovered file against a database of known MP3 file hashes and give an indication of the original MP3 file. The manual method and the automated recovery process were then tested, using testing metrics defined in 5.2. Primarily using the Fuzzy Hash comparison given by SSDeep as a metric allowed the investigation to identify which file had been obfuscated. It has a high chance of identifying the MP3, with larger similarity matches providing a better match, although this can only prove that a file matches an MP3, and doesn t prove which is the copy. Secondary to the fuzzy hashing, the other three metrics were the recovered file size, which shows that overall, very little data will be lost per file recovery, the playability of the recovered file, which allows users to confirm the file has been recovered properly and to confirm that the similarity match picked the right file to match the recovered file against, and the speed of the process, which is 48

58 important for forensic investigators to be able to assess large numbers of files in as short a time as possible. 7.1 The Process The process used to identify and recover files must now be evaluated, by looking closer at the test results, to assess how well it is able to perform the tasks it was designed to carry out, to do this the testing metrics defined in 5.2 will be used to investigate the success or failings found in the system Manual Recovery As can be seen in the results from 6.1 the identification and recovery technique proposed was a reliable and accurate, but time-consuming, process. The recovered file was identified using SSDeep as having matched the MP3 file Dr. Dre feat. Eminem - Forgot About Dre 224Kbps.mp3 which when played back was able to be confirmed as the correct audio track. The file itself was a very close match for the original file size, although off by about 150 KB in size, this is due to the amount of metadata in the file before the first frame. By calculating the size of each frame upon identification, the process does not need to allow for false positives in the recovered data, it knows that each frame is correct and moves on to the next after carving with no issues. Of course, if extra data has been inserted into a frame, this would cause an issue in the recovery method, as the frame length would not be correct, but given that this would facilitate a need for the suspect to be able to remove this themselves before they could use the file, it was deemed unlikely that a normal user would insert data into the frames as this would cause corruption. The time taken to do a manual recovery of a file is an issue. In order to be sure that they had correctly identified the frames, they would need to process a large number of contiguous frames and match their sync headers, bit rates and sample frequencies manually, before moving on through the file. At 26ms per audio frame, an MP3 of three minutes would consist of approximately 6923 frames. To manually identify this many frames could take a user an exorbitant amount of time, and therefore the automated process was developed, as the computer would be able to process the same data in a much smaller timeframe. 49

59 7.1.2 Automated Recovery Section 6.2 outlines the results of the automated process used against the test scenarios Individual Files For the individual files, all the frames were identified correctly by the tool, giving four recovered, playable files. All four files when played back were capable of being identified as four different tracks. The recovery process itself performed in a reasonable time frame, taking less than a second on all four files, and the recovered file size was generally a close match, and the length of the file playbacks being identical Test Images The first test image only had unedited MP3 files on it, so it is unsurprising that all the files were identified as being MP3 files, with frames being recovered from all of them. These recovered files all gave a high similarity match against files in the database, and every file was matched against the correct file, the fluctuations in the percentage of similarity found can be attributed to the amount of metadata original file had, which was lost from the recovered files due to the method of carving only the frames. Each recovered file was playable and sounded the same as the original file, and each recovered file had a comparable size after recovery to the file size of the original file, any differences could again be attributed to the amount of metadata lost in the process. At seconds for identification and recovery of 33 files, this works out at less than a second to identify the frames of an MP3 and recover the playable audio file. Conversely, Image 2 had no MP3 files on it and therefore no files being recovered was exactly the expected result. Interestingly though, during the recovery process on image 2, two of the PNG files were identified as having potential MP3 frame headers in them. The tool was unable to decode any frame data for these headers, so it is likely a false positive, caused by one of the potential version identifiers from Figure 4 being identified as present in the file. Given that these version identifiers are a 2-byte string, this isn t a surprise, but this shows that the tool could see that, though the identifier was present, there was no actual frame present and didn t attempt to recover any frames from that 50

60 file. It may be necessary for an investigator to investigate further on files such as these which show potential frames. The tool processing images 3 and 4, with their files obfuscated as JPG and DLL respectively, could identify all 33 of the files as being MP3 files, compared to the Sleuth Kit tool sorter, which identified all 33 files as being JPG files for image 3 and DLL/system files for image 4, this shows again that the process is able to identify obfuscated MP3 files, no matter the obfuscation type. The files recovered from these two images were all playable and each one was matched against the database to the correct audio track, confirmable by listening to them. The recovered audio sounded identical to the original file, and the size of each file was close to the original file, with only negligible changes in size, as in image 1 s results, due to the metadata that had been discarded. Processing of image 3 took seconds whilst image 4 took seconds, these 2 images had the exact same file selection on them, just obfuscated differently, and should have theoretically taken the same amount of time, so it can only be assumed that the processor of the computer was busy with a second task during the recovery of image 4. Finally, image 5, having consisted of obfuscated mp3 files and standard JPG files, took a total of seconds to process the entire image and recover files. This lower time is likely due to the fact that it only needs to identify twenty-one of the files on the image as containing MP3 frames. It therefore, had a lot less work to do recovering the MP3 files, as there were fewer of them overall, only looking at the other files enough to identify that they were not MP3 files. The results reflect the other image results as well, the twenty-one files on the image that had been obfuscated were identified as MP3 files, with twenty-one recovered files being extracted. Each of these files was identified with high similarity matches against tracks in the database, the matches could be confirmed upon playback, as the files were all playable and could be identified. The size differences between these recovered tracks and their original counterpart once again appear to be due to lost metadata, as expected, with no files having large discrepancies. As explained in section , the files can be played after recovery because they do no need the MP3 file header for playback, instead relying on the frame structure system 51

61 built into the file when no header is present. The MP3 format uses the ID3 tag as a header, and this tag acts more as a metadata container than a real identifier of the file. When developing this process, a lot of past work was looked at regarding recovery rates of fragmented files. In all these other research papers, there appeared to be a high rate of false positives and often large sections of the file were unable to be recovered. This sort of result is to be expected in file fragmentation investigations, but in this paper, these issues were not encountered at such high frequency, possibly due to the use of obfuscated, whole files, rather than fragments. By searching for the frame header as a common pattern in the file, and calculating the length of each frame based on the data in the header, the investigation managed to avoid false positive results, whilst using the Fuzzy HASHING mechanism has shown that a high level of recovery was achieved, high matches are because the files were almost identical upon recovery. 7.2 Impact on forensic investigations Forensic investigations will need to look at thousands of files at a time, applying this process, with an average of files per second during identification and recovery of each image, it can be theorised that an image containing files could be potentially scanned for MP3 files and any hidden tracks being identified and recovered within less than an hour and a half, potentially less time depending on how many files are actually MP3 files, trimming the time taken for forensic investigations significantly. What this essentially means is that an image can be scanned containing thousands of files and theoretically discover all MP3 files on the image, whether obfuscated or not, recovering them and giving investigators a method of identifying files that had been copied from another source. Although this can prove that the obfuscated file matches another MP3, it cannot prove the origin of the file. It is able to identify the match, but would be unable to say, for example, that the obfuscated file had been copied from another, it is quite possible, however unlikely, that the obfuscated file is the original file, and might even be a legal copy of an MP3, or it could even be that both are copies of an original file. 7.3 Issues discovered The process of hashing the files against a known database of files means that the investigator would need a database containing several hashes of each track at different 52

62 bitrates, to ensure that the one they found was identified. Whilst the audio track itself would be recovered, this requirement might mean that it is still impossible to prove copyright infringement. When looking at the MP3 structure, there is a single bit in the header that details if the file is copyrighted, this bit being set may help investigators prove copyright infringement, but relies upon the encoding process setting this bit, which isn t always guaranteed. At best, this means that the investigator can prove that a file exists and that the user hid this file, but not necessarily that they were breaking copyright laws by having it. Ripping CD audio tracks to MP3 is relatively easy, and there are many different programs that allow users to do this. The act of ripping CDs is in itself not illegal, provided the user owns the original CD they are well within their rights to make a digital backup for their own use, this tool doesn t have the ability to identify the source of the MP3, but if fuzzy hashing is used it can potentially tell if a file is a copy of another file, possibly found on illegal sources. Furthermore, if a file is an audio recording that is not a copyright music track, for example, a recording of a conversation between two people, then it would depend on the investigator listening to this file after the system recovers it to prove any wrongdoing. During the testing performed in this investigation, as shown in section 6.2.3, it was discovered that some PNG files appear to have code that looks like the start of an MP3 frame, this caused issues for the recovery process as it started to hang on some files trying to decode the frames. This was not an issue on such a small test bed but could cause extended delays on larger file systems with more files on them. 7.4 Personal Development To develop this project, I had to be able to understand the principles of the MP3 structure, file carving techniques and the fuzzy hashing process, in order to do this, a lot of time was spent researching these aspects and evaluating the information available from other pieces of work that had been done in the areas. This critical analysis allowed me to develop a manual methodology to be able to not only identify 53

63 MP3 frames within a file but also develop my understanding of how frame sizes were calculated, allowing me to apply that knowledge to identify each frame within an individual file. After theorising a method of discovery for MP3 frames, I then had to develop my programming skills to a point where I could write an automated process to do the discovery and recovery win a quick manner. This task was a difficult aspect; I had relatively good basic capabilities with writing code in Python, but I had never written a piece of code that would be as intricate as the code I developed here. I also had to use libraries that I needed to learn the syntax and usage of, as I had never used them. Despite identifying three versions of MP3 during my literature review, whilst coding the tool I only incorporated two, versions 1 and 2 into my identifier before breaking the identifiers down by protection. This was only noticed during the testing phase when a single MP3 from my collection was unable to be identified. I discovered the issue by looking for the frames of the MP3 manually, noticing the extra set of identifiers for the third MP3 version 2.5. I added this into my identifier script and reanalysed the file with my tool and was now able to identify the MP3. I was forced to make late changes to the project, focussing on obfuscated files instead of the initial plan to identify MP3 fragments on disk images, due to being unable to identify a method of focussing my code on identifying MP3 frames in unallocated space. The problem arose because the identifiers for frame headers, shown in Table 8, were appearing in the allocation table and unallocated space well before the location of any of the MP3s on the image I also had trouble identifying a method of loading these images with python tools that allowed the system to investigate the binary data of the entire image, given more time I may have been able to work around these issues, but as it was this meant that I had to do a lot of reworking on my process to finish the project on time. My initial method of reading the files took a large amount of time for each individual file. At approximately twenty minutes for a single file, this meant the process would take too long for an investigator to realistically use. This extended timeframe was frustrating and upon investigation, I concluded that it was due to my read process of the file. I was initially reading the file from the source at the identified offset of where 54

64 each frame should be located, this meant that for every frame it was reopening the file, looking for an offset, reading the data and then closing the file. By instead loading the file into memory at the start of each process, I could look through the already stored data each time, instead of a fresh reload of the file. Refining this process down enabled my process to change from twenty minutes per file to less than a second for the same file, and was basically just a case of optimising my code better. Developing the method and automated tool allowed me to understand the difficulties that forensic investigators have with trying to identify obfuscated files, allowing me to increase my understanding of how fuzzy hashing might be applied to files in order to show their true contents and how hashing can be used to identify data on a system. 55

65 8 Conclusions In conclusion, by understanding the MP3 File format it was possible to theorise a method for reading and understanding the frame structure used by MP3 files, then, by applying this knowledge to any file and ignoring the identifying file signature to instead look at this internal structure, identification of MP3 frames which only exist in the MP3 format, was made possible. The manual methodology was a highly accurate method of doing this identification, and the recovery stage, but very time-consuming. With files containing thousands of frames, due to the length of an individual frame, it was obvious that an automated method was needed to allow users to investigate multiple files in an abbreviated time frame. This developed automated tool not only allows recovery of data previously undiscoverable by traditional investigation methods but also due to its speed, will mean that an investigator would take less time to find the evidence for their investigation. The system developed was able to completely disregard the obfuscation employed on the file and look at the file contents itself to recognise that the file was an MP3 file. This allowed the tool to recover the audio frames from the obfuscated files, in a swift manner, by carving based upon file structure signatures, rather than the leading file signature, which is how other tools used by investigators operate. This means that a file which may appear to be a jpeg, video file, system file or other file type to other investigation tools can be identified and recovered by investigators looking for MP3 files. Existing file identification tools and file carving process require the visibility of the file signature to identify a file type, but with this proposed process, investigators can still recover these previously unfound files. This process has merged the key aspects mentioned in section of Fast Object validation and the Feature based classification used by the SmartCarving technique with the process of investigating binary structures that Oscar uses, to recover the data identified with the proposed identification technique and define a method for identifying MP3 files which other tools were not able to recognise. 56

66 What it also means is that if a file is corrupted, or only part of the file is available, the tool allows investigators to at least recover the portion of the file available. Even if that file is not a pirated audio track, for example, it is a partial recovery of a recording between people, that portion of recovered file will still be playable and can be further investigated as needed by the investigator. Verification of complete or even partial recovered MP3 files has been made possible with the use of fuzzy hashing in the system. If a whole file is recovered, the file will match almost identically to the original file, a partial file, on the other hand, may still match a known file but at a lower similarity. The method of fuzzy hashing will still recognise the file contents as belonging to the known file as the sections recovered would still give similar hash results. The process has allowed for all different versions of MP3 files to be recovered and allows the system to rule out non-mp3 files from its recovery process, and therefore would allow investigators to apply it to a suspect image to find just the audio data needed for an investigation. The aim of this project had been to develop a tool that applied an automated method of identifying and recovering MP3 frames within obfuscated files on a disk or disk image, using file carving techniques, and to then use fuzzy hashing to compare these recovered audio frames to a list of hashes of known MP3 files to identify the original file. To complete this aim, the objectives were to develop an understanding of the MP3 file format, advanced file carving techniques and the principles of Context triggered piecewise hashing. Then, using these advanced understandings, develop a manual method of identifying and recovering MP3 frames hidden by obfuscation before progressing this into an automated method by creating Python scripts. By researching the file structure used by MP3 files, it was first possible to develop a manual method of identifying MP3 frames, allowing development of this method into a python script to identify MP3 frames in a more efficient manner. Understanding advanced file carving techniques enabled implementation of these processes into the manual methodology for recovery, and implementation of the automation development allowed recovery without secondary application of extra techniques. By researching the background of context triggered piecewise hashing and the tools available to 57

67 perform it, a process of identifying the obfuscated file was developed, able to confirm what MP3 an obfuscated file had originally been, but unfortunately, not able to prove that the file had been copied from a specific source. 8.1 Future work The tool as it stands now is quite limited. By only investigating the MP3 file format, it has limited evidence recovery to that small section. Real investigations may require investigators to discover many types of files including video and image files, which this currently does not allow for. Obfuscation is used upon many of these file types to hide them and investigation of these file types to incorporate a method of identification and recovery of them could expand the tool to be even more useful during investigations. Many video formats, such as MP4 or MOV file types use a similar framing structure to MP3 files and although work would be needed to identify these parts, may be able to be adapted into the tool. Another downside to the current tool is that it only works on recognised files. Often an investigation will be attempting to discover files in unallocated space or even find fragmented files for recovery. The tool could benefit largely from development into application upon fragmented files. In its current state, it would struggle to identify these fragmented files as it relies heavily on finding MP3 frames contiguously within the file structure and would not be able to recognise corrupt, overwritten or missing sections which should contain more frames. The basic mechanisms of the current process could even be used to form the basis of this fragmented file recovery, by searching unallocated space for frame header signatures, but there would need to be a deeper investigation into identifying and excluding false positives from the results. The current process uses the location of the next frame header to ensure that what appears to be a frame header is in fact correct. With fragmented files, this might not be possible, as the next header may be in the next fragment, and therefore not located where expected. This would mean less use of the current verification method, which could allow for false positives to begin to appear in the resulting recoveries. One possible avenue of research is to look at file entropy, a good starting point for which could be the work done by Phil Penrose (Penrose, Macfarlane, & Buchanan, 2013), looking at high entropy in file fragments. 58

68 References Aziz, A. (2012). Development FPGA-based Mp3 Decoder by using Altera DE2. Retrieved February 19, 2017, from Beek, C. (2011). Open Security Research: File Carving! Retrieved October 9, 2016, from Benjamin, A. (2010). Music Compression Algorithms and Why You Should Care. Washington University in St. Louis. Retrieved from dprojects/webpages/su10/alexbenjamin_audiocompression.pdf Brand, M. (2006). Analysis avoidance techniques of malicious software. Edith Cowan University. Retrieved from de Nijs, G., Biesheuvel, A., Denissen, A., & Lambert, N. (2006). The effects of filesystem fragmentation. In Proceedings of the Linux Symposium (Vol. 1, pp ). Edström, B. (2008). blog.bjrn.se: Let s build an MP3-decoder! Retrieved November 4, 2017, from Garfinkel, S. L. (2007). Carving contiguous and fragmented files with fast object validation. Digital Investigation, 4(SUPPL.), Grill, B., & Quackenbush, S. (2005). Audio MPEG. Retrieved November 1, 2017, from id3.org. (2012). mp3frame - ID3.org. Retrieved November 2, 2016, from Karresand, M., & Shahmehri, N. (2006). Oscar File Type Identification of Binary Data in Disk Clusters and RAM Pages. In Security and Privacy in Dynamic Environments (pp ). Retrieved from _File_Type_Identification_of_Binary_Data_in_Disk_Clusters_and_RAM_Pages Kessler, G. (2017). File Signatures. Retrieved February 8, 2017, from Kornblum, J. (2006). Identifying almost identical files using context triggered piecewise hashing. Digital Investigation, 3, i

69 Maguire, H. (2015). File Carving Mp3 Fragments. University of Abertay Dundee. National Institute of Standards and Technology. (n.d.). National Software Reference Library. Retrieved February 15, 2017, from Pahade, R. K., Singh, B., & Singh, U. (2015). A Survey on Multimedia File Carving. International Journal of Computer Science & Engineering Survey, 6(6), Pal, A., & Memon, N. (2008). The Evolution of File Carving. IEEE SIGNAL PROCESSING MAGAZINE. Penrose, P., Macfarlane, R., & Buchanan, W. J. (2013). Approaches to the classification of high entropy file fragments. Digital Investigation, 10, Poisel, R., Tjoa, S., & Tavolato, P. (2011). Advanced file carving approaches for multimedia files. Journal of Wireless Mobile Networks, (August), Retrieved from proaches_for_multimedia_files/file/60b7d5152be7eef31e.pdf Primeau, E. (2010). Audio Evidence. Retrieved February 8, 2017, from Raissi, R. (2002). The Theory Behind Mp3. Retrieved from tech.org/programmer/docs/mp3_theory.pdf Richard, G. G., & Roussev, V. (2006). Next-generation digital forensics. Communications of the ACM, 49(February 2006), 6. Roussev, V., & Garfinkel, S. L. (2009). File Fragment Classification The Case for Specialized Approaches. IEEE, Sammes, A., & Jenkinson, B. (2000). Forensic computing : A practitioner s guide. London: Springer. Supurovic, P. (1998). MPEG AUDIO FRAME HEADER (mp3 format). Retrieved October 20, 2016, from US DoJ. (2016). U.S. Authorities Charge Owner of Most-Visited Illegal File-Sharing Website with Copyright Infringement OPA Department of Justice. Retrieved October 8, 2016, from ii

70 Appendices Appendix 1 Project Overview Initial Project Overview SOC10101 Honours Project (40 Credits) Title of Project: Identification, Recovery and Verification of MP3 data structures from Obfuscated Files Overview of Project Content and Milestones The ability to prove copyright infringement is still a key priority in digital forensic investigations, especially with the recent increase in charges brought against torrent site owners such as Artem Vaulin, of KickAss Torrents. The MP3 format is one of the most common file types found when investigating pirated audio and music files. With most torrent sites having thousands of torrents for music albums, being accessed by tens of thousands of downloaders every day, these files are usually copyright infringements. Finding the digital evidence of these infringements is relatively easy when files are whole and in their default state, even if deleted, but becomes progressively harder to manage when the suspect has used obfuscation on files to cover them from being discovered as copyright material. I will research the file structure of an MP3 file, the understanding of which I will apply to develop a process I can use to identify the data structure of an MP3 file hidden within obfuscated files, and to also identify files where the data is not an MP3 format. Once I have developed this process I will create a tool that will be able to use this process to identify, and recover this data into a file that will be playable for investigators and will include performing piecewise hashing against it to show what audio file the original file was, the results which can be used as evidence of copyright infringement. There have been investigations into similar processes, which tend to look at the frame header structure which appears in MP3 files, and using them to recover fragmented MP3 files. The use of this alone as a way of finding the further frames which should appear, as a method can often lead to a high number of false positives appearing in the recovered file, causing tracking and scanning issues in the playback of the file, due iii

71 to added frames. I intend my tool to use the more data available in the frame header, to calculate the size of frames on a per frame basis, with the potential inclusion of the side data available within a frame as a cross reference where needed to clear out possible false positives from my results. Once the process has managed to locate and recover the Mp3 audio, the use of piecewise hashing will allow us to show a comparison between the obfuscated file and a list of known, copyrighted audio files. The Main Deliverable(s): The main deliverable of my project will be an application which is able to identify an MP3 file from an otherwise obfuscated file, by identifying its data structure within the file, carve the data which makes up the suspected MP3 data structure and then verify the file by using piecewise hashing to compare the similarity of the recovered data, to that of known audio files. Piecewise hashing must be used instead of standard file hashing as the recovered file may be missing sections that the original file has, due to the obfuscation and my tool not recovering a file s metadata. The app should also be able to supply an accompanying text file containing relevant metadata about the file. The Target Audience for the Deliverable(s): The target audience are forensic investigators looking for files that can be used as evidence of copyright infringement, by creating a method of forensic recovery of obfuscated MP3 files, for use in copyright infringement investigations. Other people who may be interested in the results of the project are researchers who are interested in the forensic information contained inside, and recoverable from, the MP3 file structure The Work to be Undertaken: I will begin with in depth research and investigation into the file structure of an MP3 file, to allow me to identify components other than the standard frame header within the file which can be used for fragment identification and recovery. I also need to look into existing Python modules which allow me to compare any recovered data against existing known files using context triggered piecewise hashing, to verify Once my initial research is complete, I ll need to develop a manual methodology I can use for identifying the MP3 data structure, if present, within an obfuscated file, and to tell a user when the file does not contain MP3 data, that this is so. I ll also be testing iv

72 this methodology by identifying this data structure manually, and using a hex editor to carve the MP3 data from the obfuscated file, into a new file and then performing a playback of the recovered file to see how much has been recovered, and using piecewise hashing to see how close to an original file it might be. Finally, I will develop and create a tool, using the Python scripting language, which can use multiple smaller scripts to automate the identification and recovery of the MP3 data structure from obfuscated files. Final testing will be carried out by running it on a variety of files, including both obfuscated and non-obfuscated MP3 files, and also non-mp3 files, and then comparing the piecewise hash of the recovered data against known audio file types, and by playing back any recovered data to hear the results. Additional Information / Knowledge Required: An in depth understanding of the file structure of MP3 s will be required. Some of which I have already acquired during my summer internship, where I developed the idea for this project, and will need to look at in even more depth to step past the standard usage of the frame headers and include the use of the other available side information found within a file to identify fragments. I will use Python in the creation of my recovery tool. I consider myself relatively proficient with the language, recently refining my skill during my internship, and feel it is most likely the best language I could use, as it allows me to develop multiple scripts that can perform distinct functions needed to automate the process of identification and recovery of the file fragments, I will also need to look at packages already available to allow me to access the MFT. Information Sources that Provide a Context for the Project: There is already a lot of research into the file structure of MP3 s, although they tend to look at the structure itself rather than the rationale behind that structure. There have been many studies of carving MP3 files and a lot of this information is available online in articles published by well-known names in the research area. Starting references: S Garfinkel and J Metz (2007) Carving contiguous and fragmented files with fast object validation v

73 Raissi, R. (2002). The Theory Behind Mp3. Supurovic, P. (n.d.). MPEG Audio Frame Header (MP3). Retrieved from The Importance of the Project: Although the idea behind the project itself isn t new and has been attempted many times by other people, I feel that the ideas I have, to identify the data structure within obfuscated files, to help rule out false positives in my frame identification, and using context triggered piecewise hashing to verify the recovered file in comparison to known existing audio files will be a unique approach which allows investigators to recover files. The Key Challenge(s) to be Overcome: Most of the past work in this area has returned a lot of false positives in its identification processing. I have a plan in place which I hope will allow me to completely avoid these false positives from the very beginning. This means I will need to devote a lot of time in my research and development phases to ensure my understanding of the file structure will allow my methodology to incorporate these extra functional checks, and also to assess what options there are already in place in existing Python modules, allowing me the ability to verify the recovered file against known files. Another challenge I need to consider is the files and file types I will be using, during my testing phase. I will need to use multiple different file types, including both obfuscated and non-obfuscated MP3 files in order to successfully show the identification and verification of the recoverable data, and also to show that I can clearly differentiate between MP3 files and other file types. vi

74 Appendix 2 Second Formal Review Output Week 9 Report Student Name: Paul Flack Supervisor: Petra Leimich Second Marker: Sean McKeown Date of Meeting: 28/11/16 Can the student provide evidence of attending supervision meetings by means of project diary sheets or other equivalent mechanism? yes If not, please comment on any reasons presented Please comment on the progress made so far Paul has made good progress in understanding the mp3 spec and has done most of the literature review. Additionally, some work has been done to explore potential software modules for interacting with the exfat file system. Is the progress satisfactory? yes Can the student articulate their aims and objectives? yes If yes then please comment on them, otherwise write down your suggestions. While Paul is slightly behind his plan, he is still in a good position to complete the project. The higher level aims and objectives are clear, in that mp3 file carving is to be conducted by using markers found in the individual mp3 data frames. A general approach to verifying individual disk blocks was also discussed. The details of the particular experiment need work, however. While the general approach is clear, the particular experimental setup needs clarified. At the meeting it was decided that Paul needs to work on this for his next supervision meeting. Paul also wants to investigate the properties of mp3 encoding which allow for data from different frames to be encoded across frames. Does the student have a plan of work? yes If yes then please comment on that plan otherwise write down your suggestions. Paul has a Gantt chart and a good idea of when he wants to be finished with the literature review, as well when the coding period is. vii

75 Finishing the literature review and working on pieces of code which will be independent of final experimental decisions is a good idea, and will allow the project to progress while some details are worked out. Does the student know how they are going to evaluate their work? yes If yes then please comment otherwise write down your suggestions. The evaluation will depend on the final experimental setup, which in this case depends on: o how many mp3s and other files will be on the test image o status of the file system, o whether file system metadata is used, o and whether other modifications (deleted file, partially overwritten, pieces missing) are utilised. However, some general ideas for evaluating the performance of the recovery system, and how well the file plays back upon recovery were: o Listening to it manually o Comparing individual data blocks with the original file o Piecewise hashing o Waveform comparison with the original audio Any other recommendations as to the future direction of the project Paul has to clarify the focus of his tool and the experimental criteria, which go hand in hand. o Is the focus on finding mp3 blocks or reconstructing them? The file system used is likely unimportant, as long as there isn t any particular file system metadata which is exploited in the recovery process. Signatures: Supervisor Second Marker Student Please give the student a photocopy of this form immediately after the review meeting; the original should be lodged in the School Office with Leanne Clyde viii

76 Appendix 3 Project Management and Diary Sheets Gantt Chart Project Timeline ix

77 x

FRAME BASED RECOVERY OF CORRUPTED VIDEO FILES

FRAME BASED RECOVERY OF CORRUPTED VIDEO FILES D.Suresh 1, D.V.Ramana 2, D.Arun Kumar 3 * 1 Assistant Professor, Department of ECE, GMRIT, RAJAM, AP, INDIA 2 Assistant Professor, Department of ECE, GMRIT,