Digital Forensic Text String Searching: Improving Information Retrieval Effectiveness by Thematically Clustering Search Results

Similar documents
Ranking Algorithms For Digital Forensic String Search Hits

A New Measure of the Cluster Hypothesis

Chapter 27 Introduction to Information Retrieval and Web Search

A Review on Cluster Based Approach in Data Mining

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Hierarchical Document Clustering

BIG DATA ANALYTICS IN FORENSIC AUDIT. Presented in Mombasa. Uphold public interest

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Semantic Website Clustering

Information Retrieval

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

Lecture 7: Relevance Feedback and Query Expansion

University of Florida CISE department Gator Engineering. Clustering Part 5

Boosting Algorithms for Parallel and Distributed Learning

Notebook paper: TNO instance search submission 2012

CHAPTER 4: CLUSTER ANALYSIS

Chapter 6: Information Retrieval and Web Search. An introduction

Course 832 EC-Council Computer Hacking Forensic Investigator (CHFI)

60-538: Information Retrieval

Digital Forensics Lecture 01- Disk Forensics

Session 10: Information Retrieval

Search engines. Børge Svingen Chief Technology Officer, Open AdExchange

Ed Ferrara, MSIA, CISSP

Cluster Analysis for Microarray Data

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Approximate Search and Data Reduction Algorithms

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

Clustering: Overview and K-means algorithm

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

Modelling Structures in Data Mining Techniques

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Search Engines. Information Retrieval in Practice

Clustering Part 3. Hierarchical Clustering

Evaluating a Visual Information Retrieval Interface: AspInquery at TREC-6

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

Progress Report: Collaborative Filtering Using Bregman Co-clustering

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

Using Clusters on the Vivisimo Web Search Engine

A Text Retrieval Approach to Recover Links among s and Source Code Classes

Mining Distributed Frequent Itemset with Hadoop

Identifying Important Communications

CS 664 Segmentation. Daniel Huttenlocher

A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies

CS-E3200 Discrete Models and Search

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

CS Machine Learning

Introduction to Information Retrieval

Evaluation and Design Issues of Nordic DC Metadata Creation Tool

Clustering: Overview and K-means algorithm

Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points

Automatic basis selection for RBF networks using Stein s unbiased risk estimator

Contents. Preface to the Second Edition

2

C HFI C HFI. EC-Council. EC-Council. Computer Hacking Forensic Investigator. Computer. Computer. Hacking Forensic INVESTIGATOR

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

CSN08101 Digital Forensics. Module Leader: Dr Gordon Russell Lecturers: Robert Ludwiniak

Image Segmentation. Ross Whitaker SCI Institute, School of Computing University of Utah

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Mining Web Data. Lijun Zhang

Tansu Alpcan C. Bauckhage S. Agarwal

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

MATRIX BASED SEQUENTIAL INDEXING TECHNIQUE FOR VIDEO DATA MINING

COMP 465: Data Mining Still More on Clustering

A Replicated Study on Duplicate Detection: Using Apache Lucene to Search Among Android Defects

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

21. Search Models and UIs for IR

Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure

TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran

Cluster Analysis for Effective Information Retrieval through Cohesive Group of Cluster Methods

Improving Difficult Queries by Leveraging Clusters in Term Graph

Using the Kolmogorov-Smirnov Test for Image Segmentation

Document Clustering for Mediated Information Access The WebCluster Project

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Introduction to Volume Analysis, Part I: Foundations, The Sleuth Kit and Autopsy. Digital Forensics Course* Leonardo A. Martucci *based on the book:

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Outline. Structures for subject browsing. Subject browsing. Research issues. Renardus

Lesson 3. Prof. Enza Messina

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan

OHLONE COLLEGE Ohlone Community College District OFFICIAL COURSE OUTLINE

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

A Deep Relevance Matching Model for Ad-hoc Retrieval

[Raghuvanshi* et al., 5(8): August, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

MFP: The Mobile Forensic Platform

Clustering Algorithms for general similarity measures

Multimedia Information Systems

model order p weights The solution to this optimization problem is obtained by solving the linear system

Network visualization techniques and evaluation

Query-Sensitive Similarity Measure for Content-Based Image Retrieval

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

CS105 Introduction to Information Retrieval

Full-Text Indexing For Heritrix

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Single Particle Reconstruction Techniques

An Enhanced K-Medoid Clustering Algorithm

International ejournals

Transcription:

Digital Forensic Text String Searching: Improving Information Retrieval Effectiveness by Thematically Clustering Search Results DFRWS 2007 Department of Information Systems & Technology Management The University of Texas at San Antonio By Nicole L. Beebe & Jan G. Clark, Ph.D. August 13, 2007

Discussion Agenda Background Proposed Approach Experimental Methodology Data Analysis & Results Conclusion 2

Background

Digital Forensic Text String Search Searches evidence for text strings Words, email addresses, numbers, etc. Current tools use literal search techniques String matching algorithms Search hits grouped by search string Often ordered by file item and physical location Full text indexing & Boolean queries Search hits grouped by query Often ordered by file item and physical location 4

Disadvantage Analytically burdensome Hundreds of thousands of hits (or more) 80%-90% of hits are not relevant Result: High IR overhead IR overhead is any time spent doing things other than reviewing relevant search hits Query generation time Search execution time Time spent reviewing non-relevant search hits Current grouping & ordering techniques do not appreciably lessen IR overhead 5

Research Question Can IR & text mining algorithms be extended to digital forensic text string searching? 6

Extension Challenges Traditional Contexts DFTSS* Context Many, many searches Data sets grow incrementally Logical level data only Relatively homogeneous & structured data types High-end search engine platforms Few searches Data sets are often unique to each case Logical & physical level Very heterogeneous & less structured data types Relatively low-end search engine platforms *DFTSS = Digital Forensic Text String Search 7

Research Question To what extent does the extension of IR & text mining algorithms improve IR effectiveness of digital forensic text string searching? 8

IR Effectiveness IR effectiveness in DFTSS context At or near 100% recall Reasonable computational expense Provides usual hit metadata Minimizes IR overhead Keep search execution time reasonable Minimize time spent reviewing non-relevant search hits 9

Research Purpose Develop a better DFTSS process Extend IR & text mining algorithms Evaluate IR effectiveness of new process Build software tool Compare against current processes String match algorithm approach (EnCase ) Indexing / Boolean-based approach (FTK ) CAVEAT: This research is not a tool evaluation/comparison per se. It is meant to consider different/better ways to present text string search output. This approach could be added on to many digital forensic tools. The researcher is a happy consumer of both commercial tools listed! 10

Hypotheses New process outperforms current process WRT Query precision rates Query recall rates Overall process time Increased computer processing time eclipsed by savings in human analytical time Goal is to improve query precision & recall rates Get to the investigatively relevant hits more quickly A results presentation issue; not a fundamental change in the manner of the search 11

Proposed Approach New Text String Search Process

Post-Retrieval Text Clustering Post-retrieval thematic clustering of search hits Unsupervised text mining approach Can be computationally efficient Improves IR effectiveness due to Cluster Hypothesis (van Rijsbergen 1979) Computationally similar (clustered) documents tend to be relevant to the same query Top performing cluster contains 50% of relevant hits (Hearst & Pedersen 1996) Outperforms traditional ranked lists Hearst & Pedersen 1996; Leouski & Croft 1996; Leuski & Allan 2000; Leuski 2001; Leuski & Allen 2004 13

Text Clustering Algorithms Qualities Algorithms Low Computational Expense Good Cluster Quality Can Handle Noisy Data Insensitivity to Input Order Partitioning X Hierarchical X X X Density-based X X X Grid-based X X X Model-based VARIES X X X 14

Self-Organizing NNets Qualities of self-organizing neural net (NNet) Unsupervised machine learning method Model-based clustering algorithm Not computationally expensive Demonstrated success in clustering data (text & non-text) Kohonen Self-Organizing Maps (1981) Benefits Low dimensional output (2-D) Computationally efficient O(n) to O(log(n)) Able to cluster textual & non-textual data 15

New Process Computer Info Processing (CIP) Human Info Processing (HIP) Data CIP-1 Info CIP-2 CIP-3 HIP-1 Know. Raw Hits Document Vectors Document Clusters CIP-1: String Search CIP-2: Prepare Data for Clustering CIP-3: Cluster Docs via Self-Organizing Neural Network HIP-1: Manual Review & Analysis of Hits Increased CIP time should be eclipsed by HIP time savings. Shift more of the analytical burden to the computer 16

Experimental Methodology

Overview of Methodology Instantiate new process in prototype s/w tool Test hypotheses Execute same search against same digital evidence using current & new processes Measure IR effectiveness each time Compare measures 18

Software Development S/W Tool developed was not an all-in-one tool Series of s/w tools, scripts, & data handling procedures Name for interface and general process: Grouper CIP-1: String search Functionality Locate all instances of text strings Software development Modified open source digital forensics tools The Sleuth Kit (TSK) (C) Autopsy (TSK s web-based interface) (Perl) 19

Software Development (cont.) CIP-2: Data preparation for clustering Functionality Identify document vocabulary Extract all alphanumeric strings Select a reduced dimension vocabulary Apply stop word list (McCallum, Bow library) Apply stemming algorithm (Porter, 1980) Produce document vectors Software development Series of home-grown programs & scripts (C, Perl) Porter s open source stemmer (C) 20

Software Development (cont.) CIP-3: Clustering Functionality Thematically cluster documents Software development Selected Scalable SOM algorithm (Roussinov & Chen 1998) Uses binary document vectors & sparse matrix manipulation Much more computationally efficient that traditional SOMs Code (C++) provided by Dr. Dmitri Roussinov, Ariz. State Minor modifications made (debugging & output reformulation) 21

Software Development (cont.) HIP-1: Search result analysis Functionality Facilitate review of clustered search hits Thematically clustered documents Presents documents in priority order (similarity to cluster) Presents search hits in order of physical location Record key variables for IR effectiveness measures Relevancy determinations, search hit review order, date/time stamps of user activity Software development Access database w/ Access programming & VB code 22

Hypotheses Testing Basic approach Execute same search against same digital evidence using current & new processes Measure IR effectiveness each time Compare measures Test data sets Real-world case (private forensics company) Divorce case; allegations of extramarital activity; 40GB HD Mock case (created by graduate students) Murder case; allegation that wife caused husband s heart attack; 10GB HD (previously used & not wiped) 23

Hypotheses Testing (cont.) IR effectiveness Measures Query precision (accuracy) Query recall (completeness) Average precision (search engine performance score) Time (computer & human info processing time) Measurement points Incremental cut-off points 10% increments of # hits reviewed Satisficing point (Simon 1947) When elements of proof are satisfied, and/or When all key textual artifacts have been located 24

Data Analysis & Results Real-World Divorce Case Mock Murder Case

Real-World Case Results 17 search strings yielded ~25,000 search hits Query precision at incremental cut-off points Investigative Precision 100% 90% 80% 70% Precision Rate 60% 50% 40% 30% EnCase FTK Grouper 20% 10% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Cut-Off Points (% Hits Reviewed) 26

Real-World Case Results (cont.) Query recall at incremental cut-off points Investigative Recall 100% 90% 80% 70% Recall Rate 60% 50% 40% EnCase FTK Grouper 30% 20% 10% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Cut-Off Points (% Hits Reviewed) 27

Real-World Case Results (cont.) Average precision scores Score=0 : All non-relevant hits presented first Score=1 : All relevant hits presented first Tool AvgP Score EnCase 0.432 FTK 0.483 Grouper 0.781 Conclusion Post-retrieval clustering of search hits improves query precision & recall rate curves 28

Real-World Case Results (cont.) Process time Additional computer info processing time: <20 min. CIP-2 = 18.1 minutes (data preparation for clustering) CIP-3 = 8 seconds (clustering step) Savings in human analytical time observed Human Information Processing Time Time (min.) 450 400 350 300 250 200 150 100 50 0 EnCase FTK Grouper Conclusion: Post-retrieval clustering of search hits improves overall process time 10% 20% 30% 40% 50% 60% 70% 80% % Recall Desired 90% 100% 29

Real-World Case Results (cont.) Satisficing analysis Motivation Theory of administrative behavior (Simon 1947) Investigators seldom review all search hits Due to resource constraints and fatigue effects Satisficing point determination Clustered output Subjectively determined by research volunteer Current processes Objectively determined by locating same digital artifacts in EnCase & FTK output as recovered up to satisficing point in clustered output 30

Real-World Case Results (cont.) Satisficing analysis results: EnCase FTK Grouper Satisficing Cut-Off Point 99.15% 99.08% 21.99% Precision 60.9% 66.3% 85.0% Recall 98.3% 98.5% 26.4% HIP-1 Time 388.4 min. (6.47 hrs) 366.8 min. (6.11 hrs) 62.1 min. (1.04 hrs) Note: Satisficing points will vary between cases. 31

Mock Murder Case Results 19 search strings yielded ~110,000 search hits EnCase & FTK outperformed Grouper WRT query precision & recall at most cut-off points AvgP scores: EnCase & FTK ~ 0.35 Grouper ~ 0.09 Cluster quality statistics provide explanation Cluster #48 (7x7 map) Huge cluster containing vast majority of relevant hits Very heterogeneous cluster content Non-relevant hits generally presented first in this cluster Map size is too small for search hit result set!! 32

Mock Murder Case Results (cont.) Tested explanation for poor IR effectiveness (SOM size too small) Re-clustered documents into 20x10 map Saw improved cluster quality statistics Cluster #48 from 7x7 map split into 6 clusters (#181 & others) Cluster #181 in 20x10 map still largest cluster, but Noise level reduced Improved cluster content homogeneity Relevant hits now presented earlier 33

Murder Case Results (cont.) Simulated satisficing analysis results EnCase FTK Grouper (Map Size: 7x7) Grouper (Map Size: 20x10) Satisficing Cut-Off Point 58.5% 58.4% 75.9% 32.6% Precision 15.0% 15.7% 10.2% 31.2% Recall 93.7% 94.2% 65.3% 85.4% HIP-1 Time 1,457.4 min. (24.29 hrs) 1,394.2 min. (23.24 hrs) 1,544.46 min. (25.74 hrs) 663.4 min. (11.06 hrs) 34

Conclusion

Study Conclusions Extension of scalable SOMs is feasible Works on unique nature of DFTSS results Clustered search hits can improve IR effectiveness relative to precision & recall Empirical results from real-world case support claim >80% decrease in human analytical time <20 minutes additional computer processing time Empirical results from mock case do not, BUT Predominantly because of insufficient SOM granularity Simulated satisficing analysis results suggest larger map would support hypotheses regarding precision & recall 36

Study Conclusions (cont.) Clustering search hits can improve IR effectiveness relative to overall process time Not cost prohibitive relative to clustering computer information processing time (CIP-2 & CIP-3) Clustering can save human info processing time (more time than increased CIP-2 & CIP-3 time) 37

Limitations Generalizability Only two cases studied Problems with real-world data set precluded complete search; small search result set Both hard drives were somewhat small Single analysis/evaluator per case Reliance on open-source software Necessary s/w design severely biased CIP-1 time and affected overall process time hypothesis testing Not an all-in-one tool, as most are today 38

Contributions First academic work in improving IR effectiveness of DFTSS Theoretical extension of text mining research Context varied Extensibility wasn t guaranteed due to data set Practical implications Makes an important analytical approach useful Can reduce the incidence of missed evidence Lessens impact of organizational resource constraints 39

Future Research Empirically validate larger map size findings Replication needed Studies needed re: SOM parameter optimization Consideration of parameter-less SOMs Research into more appropriate stop-word lists Cluster navigation behavior studies needed Studies to better understand satisficing points in digital forensics Better tool to further test time hypotheses 40

Comments or Questions? Thank You! nicole.beebe@utsa.edu