Predicting the Types of File Fragments

Similar documents
A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification

Irfan Ahmed, Kyung-Suk Lhee, Hyun-Jung Shin and Man-Pyo Hong

Statistical Disk Cluster Classification for File Carving

BYTE FREQUENCY ANALYSIS DESCRIPTOR WITH SPATIAL INFORMATION FOR FILE FRAGMENT CLASSIFICATION

On Improving the Accuracy and Performance of Content-Based File Type Identification

Code Type Revealing Using Experiments Framework

File-type Detection Using Naïve Bayes and n-gram Analysis. John Daniel Evensen Sindre Lindahl Morten Goodwin

Introduction to carving File fragmentation Object validation Carving methods Conclusion

The Normalized Compression Distance as a File Fragment Classifier

File Fragment Encoding Classification: An Empirical Approach

Classification of packet contents for malware detection

Automated Data Type Identification And Localization Using Statistical Analysis Data Identification

Novel Approaches To The Classification Of High Entropy File Fragments

Random Sampling with Sector Identification

Supervised vs unsupervised clustering

Introduction. Collecting, Searching and Sorting evidence. File Storage

A REVIEW OF DIGITAL FORENSICS METHODS FOR JPEG FILE CARVING

Introduction Mapping the ELF Pinpointing Fragmentation Evaluation Conclusion. Bin-Carver. Automatic Recovery of Binary Executable Files

File Carving Using Sequential Hypothesis Testing

Feature-based Type Identification of File Fragments

Using purpose-built functions and block hashes to enable small block and sub-file forensics

Data Mining: STATISTICA

File Type Identification - Computational Intelligence for Digital Forensics

FRAME BASED RECOVERY OF CORRUPTED VIDEO FILES

Data Compression Fundamentals

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

VC 12/13 T16 Video Compression

Design Tradeoffs for Developing Fragmented Video Carving Tools

Comparison of Classification Algorithms for File Type Detection A Digital Forensics Perspective

Ranking Algorithms For Digital Forensic String Search Hits

RAPID RECOGNITION OF BLACKLISTED FILES AND FRAGMENTS MICHAEL MCCARRIN BRUCE ALLEN

GCSE Computing. Revision Pack TWO. Data Representation Questions. Name: /113. Attempt One % Attempt Two % Attempt Three %

Digital Media. Daniel Fuller ITEC 2110

CS570: Introduction to Data Mining

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Identification and recovery of JPEG files with missing fragments

File Fragment Classification Using Neural Networks with Lossless Representations

ACCELERATION OF DIGITAL FORENSICS FUNCTIONS USING A GPU. A Project. Presented to the faculty of the Department of Computer Science

Block Mean Value Based Image Perceptual Hashing for Content Identification

ManTech SMA. Computer Forensics and Intrusion Analysis. Fuzzy Hashing. Jesse Kornblum

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

Dissecting Files. Endianness. So Many Bytes. Big Endian vs. Little Endian. Example Number. The "proper" order of things. Week 6

CS/COE 1501

CS/COE 1501

Unicode. Standard Alphanumeric Formats. Unicode Version 2.1 BCD ASCII EBCDIC

A SURVEY ON MULTIMEDIA FILE CARVING

Audio Engineering Society. Conference Paper. Presented at the Conference on Audio Forensics 2017 June Arlington, VA, USA

JPEG: An Image Compression System. Nimrod Peleg update: Nov. 2003

Media Types. Web Architecture and Information Management [./] Spring 2009 INFO (CCN 42509) Contents. Erik Wilde, UC Berkeley School of

A Survey on Feature Extraction Techniques for Palmprint Identification

Steganography: Hiding Data In Plain Sight. Ryan Gibson

Technical Support Minitab Version Student Free technical support for eligible products

Authentication and Secret Message Transmission Technique Using Discrete Fourier Transformation

Temperature Calculation of Pellet Rotary Kiln Based on Texture

Accelerated Machine Learning Algorithms in Python

CEIC 2007 May 8, 2007

Network Traffic Measurements and Analysis

Topic Data carving, as defined by Digital Forensic Research Workshop is the process of

Business Club. Decision Trees

JPEG: An Image Compression System

255, 255, 0 0, 255, 255 XHTML:

Content Based File Type Detection Algorithms

Experiments in Mathematical Web Animation

IMAGE COMPRESSION USING FOURIER TRANSFORMS

Digital Image Processing

Multimedia Networking ECE 599

Fingerprint Image Compression

Model-Driven Engineering in Digital Forensics. Jeroen van den Bos with Tijs van der Storm and Leon Aronson

Common File Formats. Need a standard to store images Raster data Photos Synthetic renderings. Vector Graphic Illustrations Fonts

Fundamentals of Video Compression. Video Compression

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

This is not yellow. Image Files - Center for Graphics and Geometric Computing, Technion 2

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

IMAGE COMPRESSION TECHNIQUES

Publication Quality Graphics

Logo & Icon. Fit Together Logo (color) Transome Logo (black and white) Quick Reference Print Specifications

Machine Learning Techniques for Data Mining

Image creation with PHP

Image Classification for JPEG Compression

G64PMM - Lecture 3.2. Analogue vs Digital. Analogue Media. Graphics & Still Image Representation

Module 6 STILL IMAGE COMPRESSION STANDARDS

STATISTICS (STAT) Statistics (STAT) 1

Minitab 18 Feature List

MINITAB Release Comparison Chart Release 14, Release 13, and Student Versions

Classification of Protein Crystallization Imagery

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

4.3 The Normal Distribution

Nonparametric Importance Sampling for Big Data

Standard File Formats

Computer forensic science

Lecture Information. Mod 01 Part 1: The Need for Compression. Why Digital Signal Coding? (1)

I211: Information infrastructure II

Lecture Information Multimedia Video Coding & Architectures

M4-R4: INTRODUCTION TO MULTIMEDIA (JAN 2019) DURATION: 03 Hrs

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm

Volume 2, Issue 9, September 2014 ISSN

Classification by Nearest Shrunken Centroids and Support Vector Machines

File Systems and Volumes

File Classification Using Sub-Sequence Kernels

Transcription:

Predicting the Types of File Fragments William C. Calhoun and Drue Coles Department of Mathematics, Computer Science and Statistics Bloomsburg, University of Pennsylvania Bloomsburg, PA 17815

Thanks to David Reichert (student) Bloomsburg U. Stephen Rhein (student) Bloomsburg U. Ron Bosch (statistician) Harvard S. of P.H. Reza Noubary (statistician) Bloomsburg U. David Bennet (statistician) Bloomsburg U. The DFRWS referees

The Problem To identify the types of fragments of computer files (without headers and footers). File type prediction software could be used to quickly find fragments for further investigation and to improve the performance of file-carvers, virus scanners, firewalls and search engines.

File Carvers The file header and footer can be used to identify the type of the file and to locate the first and last fragment. If the file was stored contiguously and was not overwritten, file carving programs can easily recover the whole file. File fragmentation can confuse file carvers.

File fragmentation can confuse file carvers

Our Goal To develop effective methods to identify the type of a file fragment automatically. The method must work even on file fragments from the middle of the file with no header or footer.

McDaniel and Heydari (2003) Used file fingerprints (byte frequency distribution), byte frequency correlation and file headers and trailers. The best results use file headers and trailers.

Li, Wang, Stolfo and Herzog (2005) Used fileprints -- the histograms of byte frequencies and variance. Centroid fileprints are found for each type of file. File types are predicted by finding the nearest centroid fileprint. The file header is included in their tests.

From File Type Identification of File Fragments by Their Binary Structure by Karresand and Shahmehri

Karresand and Shahmehri (2006) The Oscar method also uses centroid fileprints, but a different metric. Also uses the rate of change of byte values. Optimized for JPG files and uses special information about JPG file structure. Works very well in this context (99.2% detection with no false positives.)

Veenman(2007) Applies Fisher s linear discriminant to the fileprint, entropy and Kolmogorov complexity. Similar to our approach, but we used other statistics and the longest common subsequence algorithm. It is difficult to compare performance since the tests were not the same.

Our Approaches 1. Use Fisher s classification theory from statistics. 2. Use the longest common subsequence algorithm.

Regression: Finding the best curve to fit the data

Classification Theory Similar to regression. Provides a model that fits the data as well as possible. Classification theory becomes computationally infeasible with too many variables. Statistics we used: mean, modes, standard deviation, entropy, correlation between successive bytes and the frequencies of special codes such as ASCII codes. ( We have also used a genetic algorithm to search for a good model.)

The linear discriminant projects the data on a line

Longest Common Substring X = DABCABCA Y = ACABADAB

Longest Common Substring X = DABCABCA Y = ACABADAB

Longest Common Subsequence X = DABCABCA Y = ACABADAB

Longest Common Subsequence X = DABCABCA Y = ACABADAB There is an obvious way to compute the LCS of two strings. But the time-complexity is O(n2 m ) for two strings of length m and n. If the strings have 100 characters it would take millions of years to find the LCS. Using dynamic programming the LCS can be computed in time roughly mn.

The Test Generate a model for predicting file type using 50 sample files of type 1 (good files) and 50 sample files of type 2 (bad files). Test the model on 50 good mystery files and 50 bad mystery files. See how often the predictions were correct.

Some File Types We Used Jpeg - Graphics file. Fourier-type transform applied, low amplitude waves removed (lossy compression), compressed. Gif - Graphics file. Lossless compression. Bitmap Graphics file. No compression Pdf Code for text and graphics. May be compressed Excel - Spreadsheet data. Text and numbers. Random - random sequence of bytes Executable - Machine code for a program.

The best statistics in our experiments A-E-L-H-MF-SF: Ascii, Entropy, Low, High, ModesFreq (Sum of the top 4 frequencies), SdFreq (Sd of frequencies). LongerSSeqA: The file type is predicted to be type A if the average length of the longest common subsequence with type A sample files is longer than with type B sample files. (LongerSSeqA is accurate, but slower then the linear discriminant.)

LCS Most JPG fragments have longer Average LCS with JPGs Average "Longest Common Subsequence" of JPG Frags vs. Test Files (offset 500) 1400 1200 1000 800 600 400 JPG RND GIF PDF 200 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 JPG Fragments

How well did it work? (header lost) offset=128, fragment length=896 Types Rate correct A-E-L-H-MF-SF Rate Correct LongerSSeqA JPG v. PDF.98.94 JPG v. GIF.93.87 JPG v. BMP.85.74 PDF v. GIF.89.92 PDF v. BMP.82.79 GIF v. BMP.83.81

How well did it work? (sector lost) Offset=512 fragment length=512 Types Rate correct A-E-L-H-MF-SF Rate correct LongerSseqA JPG v. PDF.80.91 JPG v. GIF.79.85 JPG v. BMP.87.75 PDF v. GIF.84.90 PDF v. BMP.90.92 GIF v. BMP.85.83

Other Experiments Used longest substring common to all files of a given type (with David Reichert). Works well for finding headers but not well for fragments without headers. The Longest Common Subsequence method appears to work well in tests with 4 file types and with randomly corrupted files

To do list Test the linear discriminant on multiple file types. Use frequencies of two-bytes (or longer) Use a genetic algorithm or Analysis of Variance to determine the best combination of statistics to use. Try the quadratic discriminant. Use JAMA for matrix inversion.

HEX: The Collegiate Journal of Computer Forensics Call for Papers A new peer-reviewed on-line journal for undergraduate computer forensics students and their instructors. The first issue will appear in Fall 2008. Submissions in the form of Word or.pdf files may be sent by e-mail to me: wcalhoun@bloomu.edu subject line.) (use HEX in the

Questions? THANK YOU!