Text Classification in Electronic Discovery. Disclaimer

Similar documents
In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Viewpoint Review & Analytics

CSCI 5417 Information Retrieval Systems. Jim Martin!

Computer Aided Draughting and Design: Graded Unit 1

THE INTERNATIONAL INSTITUTE OF CERTIFIED FORENSIC ACCOUNTANTS, INC. USA. CERTIFIED IN FRAUD & FORENSIC ACCOUNTING (Cr.

ZEROING IN DATA TARGETING IN EDISCOVERY TO REDUCE VOLUMES AND COSTS

Higher National group award Graded Unit Specification

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

What s Old Is New A Peek At Concordance Andy Kass

RINGTAIL A COMPETITIVE ADVANTAGE FOR LAW FIRMS. Award-winning visual e-discovery software delivers faster insights and superior case strategies.

These are notes for the third lecture; if statements and loops.

Victor Stanley and the Changing Face

Higher National Unit specification: general information. Graded Unit title: Computer Science: Graded Unit 2

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Shielding the Organization from Data Risk & E- Discovery Failures

Data-driven design decisions for discovery interfaces

Higher National Unit specification: general information. Graded Unit title: Computing: Networking: Graded Unit 2

Supervised Learning: The Setup. Spring 2018

Graded Unit title: Computing: Networking: Graded Unit 2

Tree-based methods for classification and regression

Outline for Today. Euler Tour Trees Revisited. The Key Idea. Dynamic Graphs. Implementation Details. Dynamic connectivity in forests.

SEPTEMBER 2018

Software Engineering

ProjectPlus. Implementations, Upgrades WHITE PAPER. Table of Contents SCOPE OF THIS PAPER... 2 PROJECTPLUS... 3 MENU SET-UP AND MANAGEMENT...

Semantics, Metadata and Identifying Master Data

EXAM PREPARATION GUIDE

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Higher National Unit specification: general information. Graded Unit 2

Federal Rules of Civil Procedure IT Obligations For

2013 Solo and Small Firm Conference

London User Group August 18, 2016

Independent consultant. Oracle ACE Director. Member of OakTable Network. Available for consulting In-house workshops. Performance Troubleshooting

The Bizarre Truth! Automating the Automation. Complicated & Confusing taxonomy of Model Based Testing approach A CONFORMIQ WHITEPAPER

Requirements Gathering

Gene Clustering & Classification

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Independent consultant. Oracle ACE Director. Member of OakTable Network. Available for consulting In-house workshops. Performance Troubleshooting

Information Retrieval. Lecture 7

The Security Role for Content Analysis

Course 832 EC-Council Computer Hacking Forensic Investigator (CHFI)

Web Information Retrieval. Exercises Evaluation in information retrieval

Databases in Discovery

Improved Database Development using SQL Compare

CATCH ERRORS BEFORE THEY HAPPEN. Lessons for a mature data governance practice

Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

RSA INCIDENT RESPONSE SERVICES

Introduction to Information Security Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

How To Make 3-50 Times The Profits From Your Traffic

ISO27001:2013 The New Standard Revised Edition

Information Retrieval

Integration of New and Big Administrative Data Sources into Turkish Statistical System

The Development of Information Systems

Multiple Variable Drag and Drop Demonstration (v1) Steve Gannon, Principal Consultant GanTek Multimedia

6. Dicretization methods 6.1 The purpose of discretization

MIS Week 9 Host Hardening

EXAM PREPARATION GUIDE

ADDRESSING TODAY S VULNERABILITIES

EXAM Administration of Symantec ediscovery Platform 8.0 for Users. Buy Full Product.

2.1 Ethics in an Information Society

Massive Data Analysis

Instructor: Craig Duckett. Lecture 03: Tuesday, April 3, 2018 SQL Sorting, Aggregates and Joining Tables

Information Retrieval

Exemplar Candidate Work Part 1 of 2 GCE in Applied ICT. OCR Advanced GCE in Applied ICT: H715 Unit G057: Database design

Reduce Your Network's Attack Surface

Atrium Webinar- What's new in ADDM Version 10

Unlock SAS Code Automation with the Power of Macros

The Specification Phase

A Class of Instantaneously Trained Neural Networks

Los Angeles User Group

Multi-label classification using rule-based classifier systems

Assured Compliance through Information Security Continuous Monitoring

Introduction to Machine Learning Prof. Mr. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

AAPA. Legal Issues and Record Retention. SML, Inc. Steve M. Lewis, President and CEO

12 reasons to use us over other search providers:

Triaging Evidence. Statistical Sampling in DFIR. Ray Strubinger DFIR Managing Consultant

A PRACTICAL GUIDE TO DISCOVERY IN MODERN PLACES: THE WEARABLES. By: Jamie Huffman Jones, partner Friday, Eldredge, & Clark, LLP

Note: In the presentation I should have said "baby registry" instead of "bridal registry," see

MAPS PhD Expert. Latest Knowledge Is Your Greatest Asset

Database and Knowledge-Base Systems: Data Mining. Martin Ester

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

RSA INCIDENT RESPONSE SERVICES

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Introduction to Information Retrieval

Data Virtualization Implementation Methodology and Best Practices

My name is Joe Bhatia, and I am president and CEO of the American National Standards Institute.

Know Your Network. Computer Security Incident Response Center. Tracking Compliance Identifying and Mitigating Threats

Skills Academy. Forensic Studies Courses

Mobility best practice. Tiered Access at Google

NEN The Education Network

Visualization and text mining of patent and non-patent data

Telecom Italia response. to the BEREC public consultation on

Lecture (08, 09) Routing in Switched Networks

Data Mining. Lecture 03: Nearest Neighbor Learning

(Refer Slide Time 3:31)

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

It Might Be Valid, But It's Still Wrong Paul Maskens and Andy Kramek

6/11/2012 THE UX REVOLUTION AT SUCCESSFACTORS. Philip Haine Vice President of User Experience

Background Check User Guide

Using the Web in Your Teaching

Transcription:

Text in Electronic Discovery Dave Lewis, Ph.D. David D. Lewis Consulting, LLC www.daviddlewis.com Slides for lecture via Skype on February 23, 2012 in LBSC 708/INFM 718 (Seminar on E-Discovery, Univ. of Maryland). 1 Disclaimer I am not a lawyer Nothing here should be considered legal advice If you need legal advice, consult a lawyer Repeat as needed 2 1

Disclaimer 2 I have a variety of interests in this area Co-founder of TREC Legal Track Publishing and public speaking in attempt to influence evolving legal standards Consulting expert and expert witness on legal cases involving e-discovery Algorithm designer for an e-discovery vendor 3 Outline What is e-discovery? IR technologies in e-discovery Text retrieval Text classification Effectiveness evaluation Summary 4 2

E-Discovery in Black and White A document is Preserved or not Reviewed or not Listed in privilege log or not Turned over as responsive or not Presented in court or not As evidence of a particular fact or not Law tries to define actions unambiguously That means classification 5 Text Deciding which of several predefined classes (groups) a text belongs to Crudest form of understanding text BUT current technology often can do well AND many tasks can be viewed as classification Particularly tasks of organizing information 6 3

Advantages of Viewing a Task As Classifiers are good plug-in modules Text in/class out" a simple interface Finite set of outputs a good basis for Conditioning action on output Counting, correlation, etc. Automated applying and learning of classifiers Software reusable across classification tasks Clarifies nature of task, successes, failure modes Straightforward numerical measures of quality 7 Organizing Sets of Classes Binary classification Three or more classes Multiclass Multilabel Ordinal Hierarchical Probabilistic and graded classification 8 4

classification in e-discovery MANAGE- MENT CULLING REVIEW PRESENTATION Other parties 9 which documents to keep? MANAGE- MENT 10 5

which documents to review? CULLING 11 which documents must we turn over? REVIEW 12 6

what types of documents do we have? REVIEW 13 which documents do we show in court? PRESENTATION 14 7

2/23/12 Arkfeld, Michael. Arkfeld on Electronic Discovery and Evidence, 2nd edition, Feb. 2010. 15 Approaches to Interactive manual classification Search and save documents Interactive manual classifier creation Search and save query Supervised learning of classifier Review documents Computer creates query 16 8

Search and Save "System" 17 Search and Save "System" 18 9

Search and Save Strengths: See interesting docs first Leverage search algorithms and interfaces Weaknesses Personnel must have both case knowledge and search skills Documents arrive in bursty fashion Constantly need these highly trained people Human variability Can't show later why doc was or wasn't saved 19 Query = Classifier Search and Save Query (Manual Classifier Construction) System 20 10

Classifier System 21 Manual Classifier Construction in Culling 1-2 orders of magnitude reduction typically quoted Traditionally simple Booleans Developed iteratively Sometimes statistical evaluation of effectiveness Things are moving very rapidly in this area 22 11

Classifier System 23 Classifier System 24 12

Classifier System 25 Research Questions in Manual Classifier Construction for E-Disco Culling: Optimizing large-grained selection Detecting redundancy Summary representations How good are people at this? How can we help them? Much from distributed search applies! 26 13

Manual Classifier Construction Strengths Classifier can be run as new docs arrive Explicit justification of decisions Weaknesses Must justify why classifier built way it was New docs may differ from old ones, requiring ongoing modification of classifier Haven't escaped need for single search/case expert 27 28 14

29 Supervised Learning 30 15

Supervised Learning for Text People manually classify documents Computer searches for a rule that Agrees with most manual classifications Is not too "complex" Uses that rule to classify future documents Rules typically associate numeric weights with words and other document features 31 REVIEW 32 16

REVIEW 33 REVIEW $ $ $ $ $ $ $ $ 34 17

REVIEW? 35 REVIEW? responsiveness? privilege? 36 18

REVIEW? 37 Other Favorable Factors for Supervised Learning Review is a recall-oriented task Classifier likely to need thousands of words Very hard to create by hand Many, weak, disparate predictors Text, custodian, time, place, organizational structure, file type, message headers,... Again very hard to manually use Objective reason the classifier is way it is "The learning algorithm said so" 38 19

?? System Classifier 39?? System Classifier 40 20

0.98 0.93 0.89 System Classifier 41 Unusual Aspects of E-Discovery for Supervised Learning 1 Ultra high recall, moderate precision 2 Extremely diverse document sets 3 Duplicate documents 4 Document families 5 Multiple assessors 6 Some categories conditional on others 7 Goal of assessment is review, not training --May be unwilling to evaluate documents for categories if not responsive All of these are pains / research opportunities 42 21

8. active learning? 0.50? 0.50? 0.50 System Classifier 43 9. diversity-biased classification A B C 0.98 0.80 0.76 System Classifier 44 22

10. transduction 0.98 0.93 0.89 System Classifier 45 Outline What is e-discovery? IR technologies in e-discovery Text retrieval Text classification Effectiveness evaluation Why I love e-discovery 46 23

The Joy of E-Discovery All that SIGIR/TREC/etc. IR geek stuff... Supervised learning from huge amounts of training data High recall search Careful statistical measurement of effectiveness Obsessed tuning to get effectiveness up Matters, works, and saves people great drudgery and cost 47 Summary E-discovery is an application where IR really matters Current technology can help a lot Advances would help even more Rigor matters: write, program as if you might have to testify about it 48 24

Thanks! I'm always happy to answer questions: Sign up if you'd like to be on list to receive updated copies of this talk 49 25