File Classification Using Sub-Sequence Kernels

DIGITAL FORENSIC RESEARCH CONFERENCE File Classification Using Sub-Sequence Kernels By Olivier DeVel Presented At The Digital Forensic Research Conference DFRWS 2003 USA Cleveland, OH (Aug 6 th - 8 th ) DFRWS is dedicated to the sharing of knowledge and ideas about digital forensics research. Ever since it organized the first open workshop devoted to digital forensics in 2001, DFRWS continues to bring academics and practitioners together in an informal environment. As a non-profit, volunteer organization, DFRWS sponsors technical working groups, annual conferences and challenges to help drive the direction of research and development. http:/dfrws.org

1 File Classification Using Sub- Sequence Kernels Olivier de Vel Computer Forensics Group DSTO, Australia

2 Presentation: Document mining and computer forensics Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix tree representation Experimental methodology Results and Conclusion

3 Computer Forensics: Focus on Document Mining Eg, disk surface contents structured data semi-structured data unstructured data

4 Presentation: Document mining and computer forensics Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix tree representation Experimental methodology Results and Conclusion

5 Semi-Structured Documents May or may not include natural language text May be compound documents Include low-level, short-range structures. For example: byte block identifiers mark-up tokens

6 Semi-Structured Documents Traditional text mining techniques generally not appropriate.

7 Presentation: Document mining and computer forensics Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix tree representation Experimental methodology Results and Conclusion

8 Broadband k-spectrum Kernel: Use an extension to the string kernel. String (k-spectrum) Kernel: Is the set of all k-length contiguous subsequences that a sequence x i (of alphabet T, size T =l) contains Feature space dimension is up to T k (eg, if k=4 and T = 256 number of dimensions 10 9 ) X i = BOOKOOPERPPO (k=4) BOOK OOKO OKOO KOOP OOPE etc. Vector representation: ( 1 (x i ), 1 (x i ),, 1 (x i )) where 1, 2,, l T k

9 Broadband k-spectrum Kernel: Broadband k-spectrum Kernel: BOOKOOPERPPO Is the set of all (0, 1,..k)-length contiguous subsequences that a sequence (of alphabet T) contains Feature space dimension is even larger! (k=4) B, BO, BOO, BOOK O, OO, OOK, OOKO O, OK, OKO, OKOO K, KO, KOO, KOOP O, OO, OOP, OOPE etc.

10 Broadband k-spectrum Kernel: Matrix representation, (BB) (x i ) : ( 1 (1)(x i ) 2 (1)(x i ) l (1)(x i ), 1 (2)(x i ) 2 (2)(x i ) l (2)(x i ), : : : 1 (2)(x i ) 2 (2)(x i ) l (2)(x i )) where i (j) T j for i = 1, 2,..l

11 Broadband k-spectrum Kernel: What s the Big Deal? This allows us to define the inner product between two input sequences, x r and x s as: GramMatrix(x r, x s ) (BB) (x r ) (BB) (x s ) and use the kernel trick when maximising the margin of separation between two classes. The decision function for the SVM learning algorithm is: f(x) = sign( I y i [ (BB) (x r ) (BB) (x s )] + b) in the transformed feature space rather than the input feature space.

12 Presentation: Document mining and computer forensics Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix tree representation Experimental methodology Results and Conclusion

13 Coloured Generalized Suffix Tree (CGST): The CGST is a labelling of the GST representation of a sequence of symbols: It stores a count vector to be stored at each node in the GST each vector element value equals the frequency of the subsequence in the sequence. BOOKOOPERPPO PPEKOOBOKKOO (k=4) B O K K (1,1) (1,1) (0,1) O (1,0) E K O (1,1) (0,1) R (0,1) P (1,0) (1,0) (1,0) K K O (1,3) (0,1) (0,1) (1,2) (0,1) K (1,0) O (0,1) P O O O B (1,2) (0,1) (0,1) P (1,0)

14 CGST: Computing the Kernel (Gram) Matrix For N input sequences, the NxN Gram matrix G is computed as follows: initialize G r,s = 0 x r and x s traverse the CGST at each node, compute the product of each vector element pair and sum to G r,s

15 Presentation: Document mining and computer forensics Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix tree representation Experimental methodology Results and Conclusion

16 Experimental Methodology: Corpus Document corpus for a controlled experiment : 5 document classes (MSWord, JPEG, MSExcel, Java source, Adobe PDF) 465 documents (0.11KB min. to 523.3KB max. size) data are scaled to unit standard deviation and zero mean

17 Experimental Methodology: Classifier SVM light as the (two-way) classifier, Obtain two-way categorisation matrix for each document type category, using 10-fold cross-validation sampling, Calculate per-document-type category classification performance statistics precision, recall and F 1 statistics.

18 Presentation: Document mining and computer forensics Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix tree representation Experimental methodology Results and Conclusion

19 Experiments: Performance Results Classification performance was evaluated using the following parameters: Minimum and maximum sub-sequence window offset lengths,

20 Experiments: Performance Results Minimum and Maximum Sub-sequence Window Offset Length: Definition: For example; sub-sequence window (length = 5) MinSeqLength (=2) MaxSeqLength (=4)

21 Experiments: Performance Results Minimum Sub-sequence Offset Length, MinSeqLength: Optimal MinSeqLength=1 100 90 80 Use broadband spectrum 70 F1 Performance (%) 60 50 40 PDF Faster drop-off for non-text 30 20 10 0 1 2 3 4 Minimum Sub-sequence Offset Length 5 Java MSExcel documents (except JPEG) Document Type JPEG MSWord [Parameters: MaxStringLength=256 MaxSeqLength=5 MinSeqCount=10]

22 Experiments: Performance Results Maximum Sub-sequence Offset Length, MaxSeqLength: 100 90 80 70 60 Optimal value is 4<MaxSeqLength<7 F1 Performance (%) 50 40 30 No significant improvements 20 10 0 1 2 3 4 5 8 Maximum Sub-sequence Offset Length 10 15 JPEG MSWord PDF Java MSExcel Document Type for large sub-sequence offset values

23 Conclusions: Promising semi-structured document categorization results using the broadband k-spectrum kernel. Experiments suggest: use broadband kernels rather than narrow band kernels (ie fixed k-length sequence), document length can be small, use a minimum suffix count to achieve improved classification performance and tree compression. Extensions: Increase the number of document types Investigate techniques for capturing longer-range data structures in documents

24 Questions? Advertisement: Computer and Intrusion Forensics by G. Mohay, A. Anderson, B.Collie, O. de Vel and R. McKemmish Artech House ISBN 1-58053-369-8 (2003)