The Security Role for Content Analysis

Similar documents
Chapter 6: Information Retrieval and Web Search. An introduction

Part I: Data Mining Foundations

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Information Retrieval

CS54701: Information Retrieval

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Information Retrieval. (M&S Ch 15)

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

Chapter 2. Architecture of a Search Engine

Information Retrieval

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Introduction. Controlling Information Systems. Threats to Computerised Information System. Why System are Vulnerable?

CS47300: Web Information Search and Management

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Protecting Information Assets - Week 10 - Identity Management and Access Control. MIS 5206 Protecting Information Assets

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques

CS 6320 Natural Language Processing

Chapter 27 Introduction to Information Retrieval and Web Search

CSCI 5417 Information Retrieval Systems. Jim Martin!

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Web Information Retrieval. Exercises Evaluation in information retrieval

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

n Learn about the Security+ exam n Learn basic terminology and the basic approaches n Implement security configuration parameters on network

Information Retrieval

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Search Engine Architecture II

Finding Information on the Information Highway. How to get around in the Internet

EXABEAM HELPS PROTECT INFORMATION SYSTEMS

Information Retrieval CSCI

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

PROTECTING INFORMATION ASSETS NETWORK SECURITY

Chapter 4. Processing Text

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.

INFORMATION ASSURANCE DIRECTORATE

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

FRAMEWORK MAPPING HITRUST CSF V9 TO ISO 27001/27002:2013. Visit us online at Flank.org to learn more.

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Introduction to Information Security Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Information Retrieval

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Introduction to Information Retrieval

Unstructured Data. CS102 Winter 2019

Reading group on Ontologies and NLP:

Data Classification, Security, and Privacy

Information Retrieval

Searching the Deep Web

Information Retrieval and Web Search

Knowledge Discovery and Data Mining 1 (VO) ( )

4. The transport layer

INTERNATIONAL CIVIL AVIATION ORGANIZATION ASIA and PACIFIC OFFICE ASIA/PAC RECOMMENDED SECURITY CHECKLIST

THE PROCESS FOR ESTABLISHING DATA CLASSIFICATION. Session #155

Information Retrieval. Information Retrieval and Web Search

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

The Trail of Electrons

Information Retrieval. Lecture 7

Lecture 8 May 7, Prabhakar Raghavan

Information Security Data Classification Procedure

Database Auditing and Forensics for Privacy Compliance: Challenges and Approaches. Bob Bradley Tizor Systems, Inc. December 2004

Introduction to Text Mining. Hongning Wang

Welcome to IBM Security Guardium Analyzer!

Data Breaches: Is IBM i Really At Risk? All trademarks and registered trademarks are the property of their respective owners.

No IT Audit Staff? How to Hack an IT Audit. Presenters. Mark Bednarz, Partner-In-Charge, Risk Advisory PKF O Connor Davies, LLP

Mining Web Data. Lijun Zhang

Basic Concepts in Intrusion Detection

Anomaly Detection in Communication Networks

How Does a Search Engine Work? Part 1

Ranking Algorithms For Digital Forensic String Search Hits

Query Phrase Expansion using Wikipedia for Patent Class Search

Detecting Spam Web Pages

Streamlined FISMA Compliance For Hosted Information Systems

Cloud Computing Standard 1.1 INTRODUCTION 2.1 PURPOSE. Effective Date: July 28, 2015

SEARCHMETRICS WHITEPAPER RANKING FACTORS Targeted Analysis for more Success on Google and in your Online Market

CSE 494: Information Retrieval, Mining and Integration on the Internet

Information Retrieval and Web Search

10 Things Every Auditor Should Do Before Performing a Security Audit

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Network Security and Cryptography. 2 September Marking Scheme

Ryen W. White, Matthew Richardson, Mikhail Bilenko Microsoft Research Allison Heath Rice University

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

Searching the Deep Web

TEXT MINING APPLICATION PROGRAMMING

TECHNICAL AND ORGANIZATIONAL DATA SECURITY MEASURES

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Data Inventory and Classification, Physical Devices and Systems ID.AM-1, Software Platforms and Applications ID.AM-2 Inventory

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Sparse Non-negative Matrix Language Modeling

AN IPSWITCH WHITEPAPER. The Definitive Guide to Secure FTP

Security Management Models And Practices Feb 5, 2008

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Virginia Commonwealth University School of Medicine Information Security Standard

An advanced data leakage detection system analyzing relations between data leak activity

Information Retrieval & Text Mining

Application Defense: An emerging Security Concept

Transcription:

The Security Role for Content Analysis Jim Nisbet Founder, Tablus, Inc. November 17, 2004

About Us Tablus is a 3 year old company that delivers solutions to provide visibility to sensitive information leaving the network Our background low-level packet capture and protocol analysis content analysis, content classification database management

What is Content Analysis? Techniques for extracting and classifying data Simplistic (deterministic) Matching specific keywords or patterns Matching meta data in specific file formats Algorithmic Statistical analysis Language based analysis Document clustering

What Does Content Analysis have to do with Security? Companies are increasingly focused on protecting their sensitive information Customer lists, SSNs (PII, PHI) Trade secrets, internal documents Proprietary information -- source code, etc. The challenge is to be able to accurately classify data as sensitive

Detecting Sensitive Information Verify that copies of sensitive information are adequately protected Remove unneeded copies of sensitive information Monitor the network for the transmission of sensitive information

Real World Ways Mail Filters Identify Information as Sensitive Most Accurate Least Accurate Embarrassing The rights metadata tells us that it is (DRM/XRML) String/pattern matching confidential, do not distribute, \d{3}\-\d{2}\-\d{4} Sensitive destination (domain, email id, etc.) Specific file extensions (C, CPP, XLS) Sensitive if file size exceeds specified size

Types of Machine Classification Techniques Linguistic Based on the text or language Semantic analysis Based on the meaning of the words Bayesian Inference Based on probability theory Statistical/AI Neural net Hybrids Lexical Resources (e.g. wordnet) Training Set

Taxonomic Classification Taxonomic classification can tell you what subjects are likely discussed in a document Would allow us to say documents that talk about concurrent routing are sensitive Networking Networking ATM Routing Computing Computing Routing Condos Routing Condos Concurrent Large Routing Large Switching Switching General Specific

Document Clustering Documents with similar characteristics grouped together Based on language Based on statistical information Based on meta data Similar Documents

Derivative Work Analysis Derivative work analysis means looking for overwhelming similarity in lexicon or form Document Versions

Vector Space Analysis Document represented by a vector of terms Words (or word stems) Phrases (e.g. computer science) Removes words on stop list Documents aren t about the Often assumed that terms are uncorrelated Correlations between term vectors implies a similarity between documents

Document Vectors Documents are represented as bags of words Represented as vectors when used computationally Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse

The Training Set: Processing Existing Sensitive Content We can gain accuracy if we can provide A representative set of sensitive content Negative training set ( outliers ) Similar content that is not sensitive Learning systems Supervised vs. Unsupervised learning

Spam vs. Sensitive Content Spam detection and sensitive content detection have significant differences.

Detecting Sensitive Information vs. Spam Detection Similarities Language analysis can be used to classify Meta data may exist to help with detection Differences Your Spam and my Spam are the same Your sensitive information and mine are different Sensitive information is often time sensitive; Spam is not Spammers try to make Spam not look like Spam; sensitive information is not obfuscated

HOW DO YOU KNOW WHAT IS SENSITIVE? ASSET CLASSIFICATION

ISO 17799: Code of Practice for Information Security Management Security Policy System Access Control Computer & Operations Management System Development and Maintenance Physical and Environmental Security Compliance Personnel Security Asset Classification and Control Business Continuity Management

ISO 17799: Code of Practice for Information Security Management Security Policy System Access Control Computer & Operations Management System Development and Maintenance Physical and Environmental Security Compliance Personnel Security Asset Classification and Control Business Continuity Management

ISO 17799: Asset Classification A suggested schema for classification: Confidentiality can the information be freely distributed? Value is this costly to replace? Time will its confidentiality status change over time Access rights who will have access to this information? Retention when can this information be destroyed?

ISO 17799: Asset Classification A suggested schema for classification: Confidentiality can the information be freely distributed? Value is this costly to replace? Time will its confidentiality status change over time Access rights who will have access to this information? Retention when can this information be destroyed?

ISO 17799: Confidentiality Suggested levels of confidentiality*: Authentication data internal security information Restricted never leaves organization Confidential Plus financial information Confidential restricted to specific roles within the company Internal restricted to company employees Public publicly accessible information * Actual classification schema used by a very large financial institution

INFORMATION RETRIEVAL (IR) IT S ALL ABOUT THE ACCURACY

Quick Overview of Search and IR Technologies In IR solutions, when we talk about accuracy we often talk about: Precision Proportion of retrieved material actually relevant Recall Proportion of relevant material actually retrieved

Is Internet Search really about search? Finding all occurrences is not too helpful Ranking the relevance of the results is key If you lower the Recall you will generally increase the Precision but you run the risk of missing something you want.

Precision/Recall Curves There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall precision x x x x recall

TREC NIST Text Retrieval Competition Text REtrieval Conference/Competition Run by NIST (National Institute of Standards & Technology) Collection: 3 Gigabytes, >1 Million Docs Newswire & full text news (AP, WSJ, Ziff) Government documents (federal register) Queries + Relevance Judgments Queries devised and judged by Information Specialists Competition Various research and commercial groups compete Results judged on precision and recall

An Example of Poor Precision

Precision and Ambiguity For search, people often automatically disambiguate the results without ever realizing that the search engine could not do this for them. If the answer you are looking for is in the first five results returned it may not matter if we can not correctly rank the closest match as #1.

Did anyone here wonder what UniCycle for SmallTalk is? YES NO

UniCycle for SmallTalk (so now you know!) YES NO

For High Precision, nothing beats Specificity A single five character query v234r finds the same single document on both Google and Yahoo.

Keyword Matching We can achieve high precision when we provide very specific keywords to match Matching generic terms is not likely to produce sufficient precision for flagging sensitive content False positives = Insufficient precision

Document Clustering Techniques If you can process a representative set of your sensitive content then you can use document clustering techniques Examples HR documents Source code for specific projects Corporate financial documents Marketing documents

Document Clustering: Documents like this We can scan new content we see for overwhelming similarity to existing sensitive documents Documents can be viewed as similar based on statistical similarities or you can look for language similarities (or both)

Text Analysis: Formatted Rich Text

Text Analysis: Remove formatting, convert to unicode

Text Analysis: Tokenize, stem, normalize

Document Representation: What values to use for terms tf (term frequency) - Count of times term occurs in document. The more times a term t occurs in document d the more likely it is that t is relevant to the document. Used alone, favors common words, long documents. df document frequency The more a term t occurs throughout all documents, the more poorly t discriminates between documents tf-idf term frequency * inverse document frequency High value indicates that the word occurs more often in this document than average.

Document Matching Based on Fingerprints Documents that are derived from the same source will likely have common fingerprints Documents that share common boilerplate information shared by many documents will not Documents talking about the same thing but not derived from the same source will probably not share common fingerprints

Text Analysis: Generate collection of fingerprints Fingerprint 1 @1: 9aee9ea8d640be7f6df0766b02411e93 findings audit follows acceptable satisfies requirements ISO 9001 2000 Fingerprint 2 @21: 4c7c05b25170b410398bffa26c7faa47 audit report version 2001 standard objective information concerning Fingerprint 3 @35: 8bfc73fa7a07aa92f091fce0b2272087 82203 Auditing 2001 During audit auditor collect objective information...

Content Alarm Approach Firewall Auditor Policy Manager Router Content Analysis Engine (CAE) Content Alarm SMTP FTP HTTP Internal Network Content Alarm Administrator Dynamic Protocol Determination TCP session re-assembly Packet capture engine

Content Alarm Approach Content Analysis Engine Multiple analyzers Keyword/regex analyzer File signature analyzer Meta data analyzer Fingerprint analyzer Database data analyzer* SMTP Auditor Policy Manager Content Analysis Engine (CAE) FTP HTTP Dynamic Protocol Determination TCP session re-assembly Packet capture engine * Version 2 feature

Content Alarm Architecture Firewall File Crawler Runs as a service monitoring sensitive contents and building fingerprint information for Content Alarm. Router Internal Network Content Alarm Content Alarm Administrator! Process both public and protected files - File ACLs can determine sensitivity! Monitors changed files and then provides Content Alarm(s) with new signature and fingerprint information

Conclusion Think of search and knowledge discovery tools as part of your security toolkit!

Contact Information Tablus, Inc. 155 Bovet Road Suite 610 San Mateo, CA 94402 www.tablus.com jim.nisbet@tablus.com