File Classification Using Sub-Sequence Kernels

Similar documents
Categories of Digital Investigation Analysis Techniques Based On The Computer History Model

The Normalized Compression Distance as a File Fragment Classifier

File Fragment Encoding Classification: An Empirical Approach

Extracting Hidden Messages in Steganographic Images

Ranking Algorithms For Digital Forensic String Search Hits

Developing a New Digital Forensics Curriculum

System for the Proactive, Continuous, and Efficient Collection of Digital Forensic Evidence

Monitoring Access to Shared Memory-Mapped Files

A Novel Approach of Mining Write-Prints for Authorship Attribution in Forensics

Rapid Forensic Imaging of Large Disks with Sifting Collectors

A Correlation Method for Establishing Provenance of Timestamps in Digital Evidence

An Evaluation Platform for Forensic Memory Acquisition Software

Automated Identification of Installed Malicious Android Applications

CAT Detect: A Tool for Detecting Inconsistency in Computer Activity Timelines

A Functional Reference Model of Passive Network Origin Identification

Unification of Digital Evidence from Disparate Sources

Hash Based Disk Imaging Using AFF4

Design Tradeoffs for Developing Fragmented Video Carving Tools

Fast Indexing Strategies for Robust Image Hashes

Breaking the Performance Wall: The Case for Distributed Digital Forensics

A Study of User Data Integrity During Acquisition of Android Devices

On Criteria for Evaluating Similarity Digest Schemes

Privacy-Preserving Forensics

Honeynets and Digital Forensics

A Framework for Attack Patterns Discovery in Honeynet Data

Multidimensional Investigation of Source Port 0 Probing

Designing Robustness and Resilience in Digital Investigation Laboratories

Leveraging CybOX to Standardize Representation and Exchange of Digital Forensic Information

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Android Forensics: Automated Data Collection And Reporting From A Mobile Device

Archival Science, Digital Forensics and New Media Art

BinComp: A Stratified Approach to Compiler Provenance Attribution

Can Digital Evidence Endure the Test of Time?

Automatic training example selection for scalable unsupervised record linkage

A Strategy for Testing Hardware Write Block Devices

Robust PDF Table Locator

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

Automated Classification. Lars Marius Garshol Topic Maps

Introduction to Kernels (part II)Application to sequences p.1

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

CHEMINSTRUMENTS DYNAMIC SHEAR TESTER MODEL DS-1000 OPERATING INSTRUCTIONS

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Support vector machine (II): non-linear SVM. LING 572 Fei Xia

Irfan Ahmed, Kyung-Suk Lhee, Hyun-Jung Shin and Man-Pyo Hong

Coding for Random Projects

Predicting the Types of File Fragments

Code Type Revealing Using Experiments Framework

Supervised Learning Classification Algorithms Comparison

CSE141 Problem Set #4 Solutions

Profile-based String Kernels for Remote Homology Detection and Motif Extraction

12 Classification using Support Vector Machines

Annotated Suffix Trees for Text Clustering

Approach Research of Keyword Extraction Based on Web Pages Document

Text Classification using String Kernels

Classification. 1 o Semestre 2007/2008

Self-tuning ongoing terminology extraction retrained on terminology validation decisions

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

The Bhopal School of Social Sciences, Bhopal

KBSVM: KMeans-based SVM for Business Intelligence

Surveying The User Space Through User Allocations

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification

K- Nearest Neighbors(KNN) And Predictive Accuracy

SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines

New String Kernels for Biosequence Data

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Exercise 4. AMTH/CPSC 445a/545a - Fall Semester October 30, 2017

Naïve Bayes for text classification

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

Advanced Data Mining Techniques

Web Service Matchmaking Using Web Search Engine and Machine Learning

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Outline of the course

ESERCITAZIONE PIATTAFORMA WEKA. Croce Danilo Web Mining & Retrieval 2015/2016

Information Assurance In A Distributed Forensic Cluster

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

Determinant of homography-matrix-based multiple-object recognition

Topic 1 Classification Alternatives

SVM cont d. Applications face detection [IEEE INTELLIGENT SYSTEMS]

Packet Classification using Support Vector Machines with String Kernels

Machine Learning: Algorithms and Applications Mockup Examination

Classification by Support Vector Machines

Chakra Chennubhotla and David Koes

Adding Source Code Searching Capability to Yioop

2. Design Methodology

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Chapter 3: Important Concepts (3/29/2015)

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

Tutorial on Machine Learning Tools

Support Vector Machines

Robust 1-Norm Soft Margin Smooth Support Vector Machine

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

Transductive Learning: Motivation, Model, Algorithms

Multi-version Data recovery for Cluster Identifier Forensics Filesystem with Identifier Integrity

Content Based Image Retrieval system with a combination of Rough Set and Support Vector Machine

6 Model selection and kernels

CS294-1 Final Project. Algorithms Comparison

Automatic Domain Partitioning for Multi-Domain Learning

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Transcription:

DIGITAL FORENSIC RESEARCH CONFERENCE File Classification Using Sub-Sequence Kernels By Olivier DeVel Presented At The Digital Forensic Research Conference DFRWS 2003 USA Cleveland, OH (Aug 6 th - 8 th ) DFRWS is dedicated to the sharing of knowledge and ideas about digital forensics research. Ever since it organized the first open workshop devoted to digital forensics in 2001, DFRWS continues to bring academics and practitioners together in an informal environment. As a non-profit, volunteer organization, DFRWS sponsors technical working groups, annual conferences and challenges to help drive the direction of research and development. http:/dfrws.org

1 File Classification Using Sub- Sequence Kernels Olivier de Vel Computer Forensics Group DSTO, Australia

2 Presentation: Document mining and computer forensics Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix tree representation Experimental methodology Results and Conclusion

3 Computer Forensics: Focus on Document Mining Eg, disk surface contents structured data semi-structured data unstructured data

4 Presentation: Document mining and computer forensics Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix tree representation Experimental methodology Results and Conclusion

5 Semi-Structured Documents May or may not include natural language text May be compound documents Include low-level, short-range structures. For example: byte block identifiers mark-up tokens

6 Semi-Structured Documents Traditional text mining techniques generally not appropriate.

7 Presentation: Document mining and computer forensics Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix tree representation Experimental methodology Results and Conclusion

8 Broadband k-spectrum Kernel: Use an extension to the string kernel. String (k-spectrum) Kernel: Is the set of all k-length contiguous subsequences that a sequence x i (of alphabet T, size T =l) contains Feature space dimension is up to T k (eg, if k=4 and T = 256 number of dimensions 10 9 ) X i = BOOKOOPERPPO (k=4) BOOK OOKO OKOO KOOP OOPE etc. Vector representation: ( 1 (x i ), 1 (x i ),, 1 (x i )) where 1, 2,, l T k

9 Broadband k-spectrum Kernel: Broadband k-spectrum Kernel: BOOKOOPERPPO Is the set of all (0, 1,..k)-length contiguous subsequences that a sequence (of alphabet T) contains Feature space dimension is even larger! (k=4) B, BO, BOO, BOOK O, OO, OOK, OOKO O, OK, OKO, OKOO K, KO, KOO, KOOP O, OO, OOP, OOPE etc.

10 Broadband k-spectrum Kernel: Matrix representation, (BB) (x i ) : ( 1 (1)(x i ) 2 (1)(x i ) l (1)(x i ), 1 (2)(x i ) 2 (2)(x i ) l (2)(x i ), : : : 1 (2)(x i ) 2 (2)(x i ) l (2)(x i )) where i (j) T j for i = 1, 2,..l

11 Broadband k-spectrum Kernel: What s the Big Deal? This allows us to define the inner product between two input sequences, x r and x s as: GramMatrix(x r, x s ) (BB) (x r ) (BB) (x s ) and use the kernel trick when maximising the margin of separation between two classes. The decision function for the SVM learning algorithm is: f(x) = sign( I y i [ (BB) (x r ) (BB) (x s )] + b) in the transformed feature space rather than the input feature space.

12 Presentation: Document mining and computer forensics Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix tree representation Experimental methodology Results and Conclusion

13 Coloured Generalized Suffix Tree (CGST): The CGST is a labelling of the GST representation of a sequence of symbols: It stores a count vector to be stored at each node in the GST each vector element value equals the frequency of the subsequence in the sequence. BOOKOOPERPPO PPEKOOBOKKOO (k=4) B O K K (1,1) (1,1) (0,1) O (1,0) E K O (1,1) (0,1) R (0,1) P (1,0) (1,0) (1,0) K K O (1,3) (0,1) (0,1) (1,2) (0,1) K (1,0) O (0,1) P O O O B (1,2) (0,1) (0,1) P (1,0)

14 CGST: Computing the Kernel (Gram) Matrix For N input sequences, the NxN Gram matrix G is computed as follows: initialize G r,s = 0 x r and x s traverse the CGST at each node, compute the product of each vector element pair and sum to G r,s

15 Presentation: Document mining and computer forensics Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix tree representation Experimental methodology Results and Conclusion

16 Experimental Methodology: Corpus Document corpus for a controlled experiment : 5 document classes (MSWord, JPEG, MSExcel, Java source, Adobe PDF) 465 documents (0.11KB min. to 523.3KB max. size) data are scaled to unit standard deviation and zero mean

17 Experimental Methodology: Classifier SVM light as the (two-way) classifier, Obtain two-way categorisation matrix for each document type category, using 10-fold cross-validation sampling, Calculate per-document-type category classification performance statistics precision, recall and F 1 statistics.

18 Presentation: Document mining and computer forensics Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix tree representation Experimental methodology Results and Conclusion

19 Experiments: Performance Results Classification performance was evaluated using the following parameters: Minimum and maximum sub-sequence window offset lengths,

20 Experiments: Performance Results Minimum and Maximum Sub-sequence Window Offset Length: Definition: For example; sub-sequence window (length = 5) MinSeqLength (=2) MaxSeqLength (=4)

21 Experiments: Performance Results Minimum Sub-sequence Offset Length, MinSeqLength: Optimal MinSeqLength=1 100 90 80 Use broadband spectrum 70 F1 Performance (%) 60 50 40 PDF Faster drop-off for non-text 30 20 10 0 1 2 3 4 Minimum Sub-sequence Offset Length 5 Java MSExcel documents (except JPEG) Document Type JPEG MSWord [Parameters: MaxStringLength=256 MaxSeqLength=5 MinSeqCount=10]

22 Experiments: Performance Results Maximum Sub-sequence Offset Length, MaxSeqLength: 100 90 80 70 60 Optimal value is 4<MaxSeqLength<7 F1 Performance (%) 50 40 30 No significant improvements 20 10 0 1 2 3 4 5 8 Maximum Sub-sequence Offset Length 10 15 JPEG MSWord PDF Java MSExcel Document Type for large sub-sequence offset values

23 Conclusions: Promising semi-structured document categorization results using the broadband k-spectrum kernel. Experiments suggest: use broadband kernels rather than narrow band kernels (ie fixed k-length sequence), document length can be small, use a minimum suffix count to achieve improved classification performance and tree compression. Extensions: Increase the number of document types Investigate techniques for capturing longer-range data structures in documents

24 Questions? Advertisement: Computer and Intrusion Forensics by G. Mohay, A. Anderson, B.Collie, O. de Vel and R. McKemmish Artech House ISBN 1-58053-369-8 (2003)