Vertex Nomina-on. Glen A. Coppersmith. Human Language Technology Center of Excellence Johns Hopkins University. improved fusion of content and context

Similar documents
Vertex Nomina-on. Glen A. Coppersmith. Human Language Technology Center of Excellence Johns Hopkins University. fusion of content and context

Fusion of Content and Context in Human Language Technology

Anomaly Detection using Adaptive Fusion of Graph Features on a Time Series of Graphs

Geometric verifica-on of matching

3D shapes types and properties

The Open University s repository of research publications and other research outputs. Search Personalization with Embeddings

CSE 788 Mathema-cal and Algorithmic Founda-ons for Data Visualiza-on. Winter 2011 Han- Wei Shen

Computing Similarity between Cultural Heritage Items using Multimodal Features

Wide-area Dissemina-on under Strict Timeliness, Reliability, and Cost Constraints

Unsupervised Rank Aggregation with Distance-Based Models

A Formal Approach to Score Normalization for Meta-search

Web Services in Ac-on. Mark Schroeder 2E Track

Inference and Representation

So#ware Specifica-ons. David Duncan March 23, 2012

Key-Value Stores: RiakKV

Privacy Breaches in Privacy-Preserving Data Mining

Homework 1. Automa-c Test Genera-on. Automated Test Genera-on Idea: The Problem. Pre- & Post- Condi-ons 2/8/17. Due Thursday (Feb 16, 9 AM)

NDBI040: Big Data Management and NoSQL Databases

OPTIONAL EXERCISE 1: CREATING A FUSION PROJECT PART A

Key-Value Stores: RiakKV

FTTH-GPON OLT Emulator with integrated Network Analyser

Harnessing Publicly Available Factual Data in the Analytical Process

Key-Value Stores: RiakKV

Exploiting Conversation Structure in Unsupervised Topic Segmentation for s

Alexandre Barreto ITA/GMU. Paulo C.G. Costa GMU. Edgar Yano ITA

Tutorial on gene-c ancestry es-ma-on: How to use LASER. Chaolong Wang Sequence Analysis Workshop June University of Michigan

CS47300: Web Information Search and Management

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Comparing Chord, CAN, and Pastry Overlay Networks for Resistance to DoS Attacks

Interactive Text Mining with Iterative Denoising

Modern Retrieval Evaluations. Hongning Wang

Random Graphs Based on Self-Exciting Messaging Activities

1. Assess the CATIA tools, workbenches, func ons, and opera ons used to create and edit 3D part models, products, and drawings.

B4M36DS2, BE4M36DS2: Database Systems 2

Delip Rao. Aug Aug (IIT) Madras MS, Computer Science. CGPA 9.5/10 Thesis: Learning from Heterogenous Knowledge Sources

Patterns and Computer Vision

K-Means and Gaussian Mixture Models

NDBI040: Big Data Management and NoSQL Databases. h p:// svoboda/courses/ ndbi040/

h p://

CSE 5243 INTRO. TO DATA MINING

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Wireless Sensor Networks CS742

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #7: En-ty/Rela-onal Model---Part 3

ShortMAC: Efficient Data-plane Fault Localization. Xin Zhang, Zongwei Zhou, Hsu- Chun Hsiao, Tiffany Hyun- Jin Kim Adrian Perrig and Patrick Tague

Centrality Book. cohesion.

CS558 Programming Languages

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

An Active Service for Multicast Video Distribution. tion. 1 Introduction

The Watchful Forensic Analyst: Multi-Clue Information Fusion with Background Knowledge

NEIGHBOR SELECTION FOR VARIANCE REDUCTION IN I DDQ and OTHER PARAMETRIC DAT A

THE KNOWLEDGE MANAGEMENT STRATEGY IN ORGANIZATIONS. Summer semester, 2016/2017

Diffusion and Clustering on Large Graphs

DFM Concurrent Costing

Learning the Structure of Deep, Sparse Graphical Models

AMPS Snapshot: User Registra on External Users

electronic license applications user s guide Contents What you need Page 1 Get started Page 3 Paper Non-Resident Licensing Page 10

Facebook Adver-sing Strategies for

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Clustering part II 1

Fast Contextual Preference Scoring of Database Tuples

Column-Family Stores: Cassandra

Evaluating an Associative Browsing Model for Personal Information

Where Should the Bugs Be Fixed?

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Genetic Algorithm Applied to Graph Problems Involving Subsets of Vertices

Indoor WiFi Localization with a Dense Fingerprint Model

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

PIPELINE Program Competency Model for Informa on Technology So ware Tes ng and Quality Assurance Career Cluster Pathway

Chapter 5snow year.notebook March 15, 2018

MapReduce, Apache Hadoop

Microso 埘 Exam Dumps PDF for Guaranteed Success

Query-based Text Normalization Selection Models for Enhanced Retrieval Accuracy

Towards Asynchronous Distributed MCMC Inference for Large Graphical Models

Series. Teacher. Geometry

Cross-Instance Tuning of Unsupervised Document Clustering Algorithms

Improving 3-D 3 D processing: an efficient data structure and scale selection. Jean-François Lalonde Vision and Mobile Robotics Laboratory

Filling the Gaps. Improving Wikipedia Stubs

Kapitel 4: Clustering

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

CSE 5243 INTRO. TO DATA MINING

h p://

Statistical network modeling: challenges and perspectives

CPR: Composable Performance Regression for Scalable Multiprocessor Models

Streamlining Your Work with Macros. Workshop Manual

STM. Computing. Specifica on Topics. High Level Skills you should think about to take your work to the next level:

Clustering using Topic Models

Human detection using histogram of oriented gradients. Srikumar Ramalingam School of Computing University of Utah

A Java Based Component Identification Tool for Measuring Circuit Protections. James D. Parham J. Todd McDonald Michael R. Grimaila Yong C.

Networking for Wide Format Printers

Supplementary material to Epidemic spreading on complex networks with community structures

Clustering in Data Mining

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

In Workflow. Viewing: Last edit: 11/04/14 4:01 pm. Approval Path. Programs referencing this course. Submi er: Proposing College/School: Department:

Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets. Fernando Chirigati Harish Doraiswamy Theodoros Damoulas

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Fractional Similarity : Cross-lingual Feature Selection for Search

Approximately Uniform Random Sampling in Sensor Networks

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*

VMWare. Inc. 발표자 : 박찬호. Memory Resource Management in VMWare ESX Server

Transcription:

Vertex Nomina-on improved fusion of content and context Glen A. Coppersmith Human Language Technology Center of Excellence Johns Hopkins University Presented at Interface Symposium: May 17, 2012

Vertex Nomina-on improved fusion of content and context Glen A. Coppersmith Human Language Technology Center of Excellence Johns Hopkins University Presented at Interface Symposium: May 17, 2012

Vertex Nomina-on improved fusion of content and context Human Language Content Communica/ons Graph Glen A. Coppersmith Human Language Technology Center of Excellence Johns Hopkins University Presented at Interface Symposium: May 17, 2012

Our data: Enron Email Corpus j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron

Mo-va-on and Problem Statement We know the iden--es of a few fraudsters. We observe the content and the context of both fraudsters and non- fraudsters. We want to know the iden--es of more fraudsters. Inference Task Nominate persons likely to be fraudsters.

Outline Introduc-on Method Importance Sampling Evalua-on Analy-cs Content and Context Fusions Conclusions & Future Direc-ons

Outline Introduc-on Method Importance Sampling Evalua-on Analy-cs Content and Context Fusions Conclusions & Future Direc-ons

Mo-va-on and Problem Statement We know the iden--es of a few fraudsters. We observe the content and the context of both fraudsters and non- fraudsters. We want to know the iden--es of more fraudsters. Inference Task Nominate persons likely to be fraudsters.

Mo-va-on and Problem Statement We know the iden--es of a few fraudsters. [Graph a[ributed with human language]interes-ng and non- interes-ng. We want to know the iden--es of more fraudsters. Inference Task Nominate persons likely to be fraudsters.

With loss of generality Ne^lix [See Trevor s Keynote] Genomics [Half the talks I ve seen at IF12] Noted similari-es Recommender Systems We focus on those with a graph and those with human language

With loss of generality Ne^lix [See Trevor s Keynote] Genomics [Half the talks I ve seen at IF12] Noted similari-es Recommender Systems We focus on those with a graph and those with human language

What s already been done? Theory (Nam) Lee & Priebe 2011 (Dominic) Lee & Priebe (submi[ed 2012) Simula-ons and Experiments Marche[e, Priebe, and Coppersmith (2011) Coppersmith & Priebe (submi[ed 2011) (Content + Context) > (Content Context) Human Language Technology and Graph Theory Assump-ons are valid for the Enron data

What s already been done? (Content + Context) > (Content Context) Human Language Technology and Graph Theory Assump-ons are valid for the Enron data Importance Sampling provides sensible par--ons

Assump-ons The fraudsters talk to each other more than expected of a random pair of people. The fraudsters talk about different things than expected of a random pair of people.

Fusion of Disparate Informa-on Some signal from the communica-ons graph Some signal from the human language content How do you fuse them? A principled fusion would be nice A useful fusion is more important Robust to real world problems Scalable to real world applica-ons

Ques-ons for this talk How should we fuse? (both performance and scalability are important) What kind of HLTs should we use? (Do they do different things?)

Outline Introduc-on Method Importance Sampling Evalua-on Analy-cs Content and Context Fusions Conclusions & Future Direc-ons

Mathema-cal Model: κ graph V =n M =m M =m V\M =n- m p = [p 0,p 1 ] s = [s 0,s 1 ] p 0 = s 0 p 1 < s 1 p p V\M s M κ(n, p, m, m, s)

Mathema-cal Model: κ graph V =n M =m M =m V\M =n- m p = [p 0,p 1 ] s = [s 0,s 1 ] p 0 = s 0 p 1 < s 1 p κ(n, p, m, m, s) p s V\M M fraudsters innocents

Assump-ons The fraudsters talk to each other more than expected of a random pair of people. M is more dense than V\M The fraudsters talk about different things than expected of a random pair of people. p[p 0,p 1 ] and s[s 0,s 1 ] p 0 = s 0, p 1 < s 1 V\M M

Our data: Enron Email Corpus j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron

Our data: Enron Email Corpus j.lovarato@enron e.haedicke@enron but where are the fraudsters, Glen!? r.shapiro@enron k.lay@enron a.zipper@enron b.rogers@enron

Importance Sampling Procedure Randomly par--on Enron into M and V\M Ques-on assump-ons Density M > Density V\M Topic Distribu-on M Topic Distribu-on V\M Discard par--ons that violate assump-ons. Collect 5000 par--ons. V\M M

Importance Sampling j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron Hand- labels provided by Michael Berry, 2004

Importance Sampling j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron Hand- labels provided by Michael Berry, 2004

Importance Sampling M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M

Tes-ng Par--ons: Density ρ(m) = observed edges in M possible edges in M Δρ = ρ(m) ρ(v\m) The fraudsters talk to each other more than expected of a random pair of people. V\M M

Tes-ng Par--ons: Topic Distribu-on Topic(M) =.2.1 0 0.25 Topic(V\M) =.1.1.3.15.1 Δ Topic =.1 0 -.3 -.15.15 The fraudsters talk about different things than expected of a random pair of people. V\M M

Importance Sampling M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M

Method M =10 M =5 V\M =179 For each vertex (v) in V\M we calculate each analy-c. Rank ver-ces according to each analy-c or fusion. Evaluate quality of ranked lists. M M V\M

One experiment M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M

One experiment M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M

Evalua-on Ranked lists evaluated by standard Informa-on Retrieval measures What is our inference task? Need to find all of them Mean Average Precision (MAP) Need to find one more of them Mean Reciprocal Rank (MRR) Can only examine k ver-ces (p@k), k = {5,10}

Outline Introduc-on Method Importance Sampling Evalua-on Analy-cs Content and Context Fusions Conclusions & Future Direc-ons

Context Analy-c The fraudsters talk to each other more than expected of a random pair of people. Number of known fraudsters in 1- hop neighborhood of candidate vertex. M j.lovarato@enron e.haedicke@enron k.lay@enron

Content Analy-cs (HLTs) The fraudsters talk about different things than expected of a random pair of people. How similar is the content of each candidate vertex to the known fraudsters?

Training HLTs M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M

Training HLTs M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M

Training HLTs M e.haedicke@enron j.lovarato@enron Interes-ng k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M

Training HLTs M e.haedicke@enron j.lovarato@enron Interes-ng k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M

Training HLTs M e.haedicke@enron j.lovarato@enron Interes-ng k.lay@enron r.shapiro@enron Uninteres-ng a.zipper@enron b.rogers@enron V\M

HLT 1 : Average Word Count Histogram What propor-on of the document d i is made up of word w j? Each d i represented as probability vector x i. x = W, W word types in the corpus. Vector I is average of all interes-ng ( ) x i. Interes-ng

HLT 1 : Average Word Count Histogram Score each d i by 1- JS( x i, I ) 0.83 (High) Interes-ng 0.23 (Low)

HLT 1 : Average Word Count Histogram Score all d i for each vertex: (k.lay@enron) Average scores Interes-ng M j.lovarato@enron e.haedicke@enron k.lay@enron

HLT 1 : Average Word Count Histogram Score all d i for each vertex: (k.lay@enron) Average scores Interes-ng M j.lovarato@enron e.haedicke@enron k.lay@enron

HLT 1 : Average Word Count Histogram Score all d i for each vertex: (k.lay@enron) Average scores Interes-ng M j.lovarato@enron e.haedicke@enron k.lay@enron

HLT 1 : Average Word Count Histogram Score all d i for each vertex: (k.lay@enron) Average scores 0.72 0.11 Interes-ng 0.53 0.42 M j.lovarato@enron e.haedicke@enron k.lay@enron

HLT 1 : Average Word Count Histogram Score all d i for each vertex: (k.lay@enron) Average scores 0.72 Interes-ng 0.11 0.53 0.42 0.45 M j.lovarato@enron e.haedicke@enron k.lay@enron

HLT 2 : Closest Word Count Histogram Score all d i for each vertex: (k.lay@enron) Score only by closest document 0.72 Interes-ng 0.11 0.53 0.42 0.72 M j.lovarato@enron e.haedicke@enron k.lay@enron

HLT 3 : Compression Language Modeling How well does a given message compress? Repeated sequences compress well Novel sequences do not compress well

HLT 3 : Compression Language Modeling Discrimina-ve Model which model fits be[er? Interes-ng 0.13 0.38 Red Uninteres-ng

HLT 3 : Compression Language Modeling M e.haedicke@enron j.lovarato@enron Interes-ng k.lay@enron r.shapiro@enron Uninteres-ng a.zipper@enron b.rogers@enron V\M

HLT 4 : Topic Modeling Latent Dirichlet Alloca-on (ala Blei, Wallach, McCallum, Mimno,...) [we use mallet] Each document is a mixture of topics Each topic is a probability distribu-on over W mallet is available from h[p://mallet.cs.umass.edu/

HLT 4 : Topic Modeling j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron

HLT 4 : Topic Modeling j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron

HLT 4 : Topic Modeling M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M

HLT 4 : Topic Modeling M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M

HLT 4 : Topic Modeling M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M

HLT 4 : Topic Modeling Score all d i for each vertex: (k.lay@enron) Sum the weight given to topics of interest M j.lovarato@enron e.haedicke@enron k.lay@enron

HLT 4 : Topic Modeling Score all d i for each vertex: (k.lay@enron) Sum the weight given to topics of interest M j.lovarato@enron e.haedicke@enron k.lay@enron

HLT 4 : Topic Modeling Score all d i for each vertex: (k.lay@enron) Sum the weight given to topics of interest M j.lovarato@enron e.haedicke@enron k.lay@enron

HLT 4 : Topic Modeling Score all d i for each vertex: (k.lay@enron) Sum the weight given to topics of interest 0.12 M j.lovarato@enron e.haedicke@enron k.lay@enron

Individual Analy-c Performance 0.0 0.1 0.2 0.3 0.4 MAP Context Average WCH Minimum WCH Compression Topic Modeling

Individual Analy-c Performance 0.0 0.1 0.2 0.3 0.4 MAP * Context Average WCH Minimum WCH Compression Topic Modeling

Outline Introduc-on Method Importance Sampling Evalua-on Analy-cs Content and Context Fusions Conclusions & Future Direc-ons

Linear Fusion F = γ HLT 1 + (1- γ) Context 2 Analy-cs

Linear fusion 2 Analy-cs Performance MAP 0.25 0.0 0.2 0.4 0.6 0.8 1.0 0 0 0.0 0.2 0.4 0.6 0.8 1.0 Context γ 1 Content

Linear fusion 2 Analy-cs Performance MAP 0.25 0.0 0.2 0.4 0.6 0.8 1.0 0 Grid search over Content and Context 0 0.0 0.2 0.4 0.6 0.8 1.0 Context γ 1 Content

Linear Fusion F = γ HLT 1 + (1- γ) Context 2 Analy-cs F = Σ i (γ i HLT i ) + (1- Σ i γ i ) Context Arbitrary number of Analy-cs NB: Scores must be calibrated.

MAP MRR p@5 p@10 Gridsearch Linear Combina-ons 0.0 0.1 0.2 0.3 0.4 0.5 MAP

Rank Fusion Fuse ranks instead of scores. Rank ver-ces by each analy-c. Each vertex represented by vector of ranks Fusion score is a func-on of that vector Min() One measure can damn you Max() One measure can save you Median() Something in between

Rank Fusion MAP 0.0 0.1 0.2 0.3 0.4 0.5 min median max min median max 3 Analy-cs 5 Analy-cs

Rank Fusion MAP 0.0 0.1 0.2 0.3 0.4 0.5 min median max min median max 3 Analy-cs 5 Analy-cs

Rank Fusion MAP 0.0 0.1 0.2 0.3 0.4 0.5 min median max min median max 3 Analy-cs 5 Analy-cs

Individual Rank Fusions Gridsearch Linear Combina-ons Overall Comparisons 0.0 0.1 0.2 0.3 0.4 0.5 MAP

Individual Rank Fusions Gridsearch Linear Combina-ons Overall Comparisons 0.0 0.1 0.2 0.3 0.4 0.5 MAP

Individual Rank Fusions Gridsearch Linear Combina-ons O(n) in number of Analy-cs 0.0 0.1 0.2 0.3 0.4 0.5 MAP

Individual Rank Fusions Gridsearch Linear Combina-ons O(x n ) in number of Analy-cs!! 0.0 0.1 0.2 0.3 0.4 0.5 MAP

Individual Rank Fusions Gridsearch Linear Combina-ons Nevermind O(.) for a moment 0.0 0.1 0.2 0.3 0.4 0.5 MAP ~11 minutes ~13 days

Individual Rank Fusions Gridsearch Linear Combina-ons Nevermind O(.) for a moment 0.0 0.1 0.2 0.3 0.4 0.5 MAP ~11 minutes ~13 days

Linear fusion 3 Analy-cs

Which HLTs should I use? Average WCH + Minimum WCH + Topic Modeling 0.1 Other HLTs 0.9 0.8 0.6 0.7 0.7 0.8 0.9 0.6 0.5 0.1 0.2 0.4 0.3 0.4 0.5 0.4 0.3 0.2 0.3 0.5 Compression 0.6 0.2 0.7 0.8 0.1 0.9 Context

Which HLTs should I use? Need to find them all Need to find only one more 0.1 Other HLTs 0.9 0.8 0.6 0.7 0.7 0.8 0.9 0.6 0.5 0.1 0.2 0.4 0.3 0.4 0.5 0.4 0.3 0.2 0.3 0.5 Compression 0.6 0.2 0.7 0.8 0.1 0.9 Can only look at 5 or 10 Context

Outline Introduc-on Method Importance Sampling Evalua-on Analy-cs Content and Context Fusions Conclusions & Future Direc-ons

Ques-ons How should we fuse? For small number of analy-cs grid search linear For large number of analy-cs rank fuse (or think harder) What kind of HLTs should we use? Depends on your inference task, of course. We can actually recommend what to use! Insight into rela-ve strengths of HLTs exposed.

Future and Related Work This is post- hoc fusion of scores or ranks, what about more na-ve fusion? Different data and inference tasks What movie should I watch? (Ne^lix) Who should I cite? (ACL) Who should I work with? (Github) Vote predic-on (Congressional Bills)

Collaborators Carey Priebe [JHU HLTCOE] Allen Gorin [JHU HLTCOE] Richard Cox [JHU HLTCOE] David Marche[e [Navy] Yongser Park [JHU CIS] Minh Tang [JHU AMS] Michael Decerbo [Raytheon BBN] Hanna Wallach [UMass Amherst] Jim Mayfield [JHU APL/HLTCOE] Paul McNamee [JHU APL/HLTCOE] William Szewczyk [DoD]

Collaborators Carey Priebe [JHU HLTCOE] Allen Gorin [JHU HLTCOE] Richard Cox [JHU HLTCOE] David Marche[e [Navy] Yongser Park [JHU CIS] Minh Tang [JHU AMS] Michael Decerbo [Raytheon BBN] Hanna Wallach [UMass Amherst] Jim Mayfield [JHU APL/HLTCOE] Paul McNamee [JHU APL/HLTCOE] William Szewczyk [DoD]

Thank you. Coppersmith & Priebe (submi[ed): [arxiv.org/1201.4118] Latest papers (oƒen) available from glencoppersmith.com

Content of Importance Sample California Power.. Portland...San Diego..... WBI k.lay@enron California Power.. Portland.. Portland..... bcf.. WBI Energy Services billions of cubic feet e.haedicke@enron California Power.. San Diego..... WBI...bcf.... j.lovarato@enron

Content of Importance Sample r.shapiro@enron Pro Football.. Raiders... touchdown.....implode b.rogers@enron 9/11.. 9/11.. New York.... tragedy a.zipper@enron

Context of Importance Sample

Context of Importance Sample Observed s 1 = 0.45

Context of Importance Sample Observed p 1 = 0.13

Context of Importance Sample Observed p 1 = 0.12

Context of Importance Sample Observed p 1 = 0.13 Observed s 1 = 0.45 Observed p 1 = 0.12

Content of Importance Sample 0 + 12.. Portland...San Diego..... WBI k.lay@enron.. Portland.. Portland..... bcf.. WBI Energy Services billions of cubic feet e.haedicke@enron 1 + 8.. San Diego..... WBI...bcf.... j.lovarato@enron 5 + 31

Comparing HLT Methods MRR 0.0 0.2 0.4 0.6 0.8 1.0 + + + X + + + X + + + + + N X X X X X + + N + + + + X N N N N N X X N + + N N + X + X N N N + X N N N N N N N N N N + + N X + + N N+ N N X N N N N N N N N N N N N N N N+ N + + 0.0 0.1 0.2 0.3 0.4 S 1

Importance Sampled Joint Δ Topic 0.5 1.0 1.5 2.0 0.0 0.1 0.2 0.3 0.4 0.5 Δ ρ

Injec-on Importance 0.06 0.06 0.05 0.05 0.04 0.04 P 0 0.03 0.02 0.03 0.02 0.01 0.01 0.00 0.00 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.00 0.01 0.02 0.03 0.04 0.05 0.06 P 1 P 1 0.6 0.6 0.5 0.5 0.4 0.4 S 0 0.3 0.2 0.3 0.2 0.1 0.1 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 S 1 S 1