Vertex Nomina-on improved fusion of content and context Glen A. Coppersmith Human Language Technology Center of Excellence Johns Hopkins University Presented at Interface Symposium: May 17, 2012
Vertex Nomina-on improved fusion of content and context Glen A. Coppersmith Human Language Technology Center of Excellence Johns Hopkins University Presented at Interface Symposium: May 17, 2012
Vertex Nomina-on improved fusion of content and context Human Language Content Communica/ons Graph Glen A. Coppersmith Human Language Technology Center of Excellence Johns Hopkins University Presented at Interface Symposium: May 17, 2012
Our data: Enron Email Corpus j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron
Mo-va-on and Problem Statement We know the iden--es of a few fraudsters. We observe the content and the context of both fraudsters and non- fraudsters. We want to know the iden--es of more fraudsters. Inference Task Nominate persons likely to be fraudsters.
Outline Introduc-on Method Importance Sampling Evalua-on Analy-cs Content and Context Fusions Conclusions & Future Direc-ons
Outline Introduc-on Method Importance Sampling Evalua-on Analy-cs Content and Context Fusions Conclusions & Future Direc-ons
Mo-va-on and Problem Statement We know the iden--es of a few fraudsters. We observe the content and the context of both fraudsters and non- fraudsters. We want to know the iden--es of more fraudsters. Inference Task Nominate persons likely to be fraudsters.
Mo-va-on and Problem Statement We know the iden--es of a few fraudsters. [Graph a[ributed with human language]interes-ng and non- interes-ng. We want to know the iden--es of more fraudsters. Inference Task Nominate persons likely to be fraudsters.
With loss of generality Ne^lix [See Trevor s Keynote] Genomics [Half the talks I ve seen at IF12] Noted similari-es Recommender Systems We focus on those with a graph and those with human language
With loss of generality Ne^lix [See Trevor s Keynote] Genomics [Half the talks I ve seen at IF12] Noted similari-es Recommender Systems We focus on those with a graph and those with human language
What s already been done? Theory (Nam) Lee & Priebe 2011 (Dominic) Lee & Priebe (submi[ed 2012) Simula-ons and Experiments Marche[e, Priebe, and Coppersmith (2011) Coppersmith & Priebe (submi[ed 2011) (Content + Context) > (Content Context) Human Language Technology and Graph Theory Assump-ons are valid for the Enron data
What s already been done? (Content + Context) > (Content Context) Human Language Technology and Graph Theory Assump-ons are valid for the Enron data Importance Sampling provides sensible par--ons
Assump-ons The fraudsters talk to each other more than expected of a random pair of people. The fraudsters talk about different things than expected of a random pair of people.
Fusion of Disparate Informa-on Some signal from the communica-ons graph Some signal from the human language content How do you fuse them? A principled fusion would be nice A useful fusion is more important Robust to real world problems Scalable to real world applica-ons
Ques-ons for this talk How should we fuse? (both performance and scalability are important) What kind of HLTs should we use? (Do they do different things?)
Outline Introduc-on Method Importance Sampling Evalua-on Analy-cs Content and Context Fusions Conclusions & Future Direc-ons
Mathema-cal Model: κ graph V =n M =m M =m V\M =n- m p = [p 0,p 1 ] s = [s 0,s 1 ] p 0 = s 0 p 1 < s 1 p p V\M s M κ(n, p, m, m, s)
Mathema-cal Model: κ graph V =n M =m M =m V\M =n- m p = [p 0,p 1 ] s = [s 0,s 1 ] p 0 = s 0 p 1 < s 1 p κ(n, p, m, m, s) p s V\M M fraudsters innocents
Assump-ons The fraudsters talk to each other more than expected of a random pair of people. M is more dense than V\M The fraudsters talk about different things than expected of a random pair of people. p[p 0,p 1 ] and s[s 0,s 1 ] p 0 = s 0, p 1 < s 1 V\M M
Our data: Enron Email Corpus j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron
Our data: Enron Email Corpus j.lovarato@enron e.haedicke@enron but where are the fraudsters, Glen!? r.shapiro@enron k.lay@enron a.zipper@enron b.rogers@enron
Importance Sampling Procedure Randomly par--on Enron into M and V\M Ques-on assump-ons Density M > Density V\M Topic Distribu-on M Topic Distribu-on V\M Discard par--ons that violate assump-ons. Collect 5000 par--ons. V\M M
Importance Sampling j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron Hand- labels provided by Michael Berry, 2004
Importance Sampling j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron Hand- labels provided by Michael Berry, 2004
Importance Sampling M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M
Tes-ng Par--ons: Density ρ(m) = observed edges in M possible edges in M Δρ = ρ(m) ρ(v\m) The fraudsters talk to each other more than expected of a random pair of people. V\M M
Tes-ng Par--ons: Topic Distribu-on Topic(M) =.2.1 0 0.25 Topic(V\M) =.1.1.3.15.1 Δ Topic =.1 0 -.3 -.15.15 The fraudsters talk about different things than expected of a random pair of people. V\M M
Importance Sampling M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M
Method M =10 M =5 V\M =179 For each vertex (v) in V\M we calculate each analy-c. Rank ver-ces according to each analy-c or fusion. Evaluate quality of ranked lists. M M V\M
One experiment M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M
One experiment M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M
Evalua-on Ranked lists evaluated by standard Informa-on Retrieval measures What is our inference task? Need to find all of them Mean Average Precision (MAP) Need to find one more of them Mean Reciprocal Rank (MRR) Can only examine k ver-ces (p@k), k = {5,10}
Outline Introduc-on Method Importance Sampling Evalua-on Analy-cs Content and Context Fusions Conclusions & Future Direc-ons
Context Analy-c The fraudsters talk to each other more than expected of a random pair of people. Number of known fraudsters in 1- hop neighborhood of candidate vertex. M j.lovarato@enron e.haedicke@enron k.lay@enron
Content Analy-cs (HLTs) The fraudsters talk about different things than expected of a random pair of people. How similar is the content of each candidate vertex to the known fraudsters?
Training HLTs M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M
Training HLTs M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M
Training HLTs M e.haedicke@enron j.lovarato@enron Interes-ng k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M
Training HLTs M e.haedicke@enron j.lovarato@enron Interes-ng k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M
Training HLTs M e.haedicke@enron j.lovarato@enron Interes-ng k.lay@enron r.shapiro@enron Uninteres-ng a.zipper@enron b.rogers@enron V\M
HLT 1 : Average Word Count Histogram What propor-on of the document d i is made up of word w j? Each d i represented as probability vector x i. x = W, W word types in the corpus. Vector I is average of all interes-ng ( ) x i. Interes-ng
HLT 1 : Average Word Count Histogram Score each d i by 1- JS( x i, I ) 0.83 (High) Interes-ng 0.23 (Low)
HLT 1 : Average Word Count Histogram Score all d i for each vertex: (k.lay@enron) Average scores Interes-ng M j.lovarato@enron e.haedicke@enron k.lay@enron
HLT 1 : Average Word Count Histogram Score all d i for each vertex: (k.lay@enron) Average scores Interes-ng M j.lovarato@enron e.haedicke@enron k.lay@enron
HLT 1 : Average Word Count Histogram Score all d i for each vertex: (k.lay@enron) Average scores Interes-ng M j.lovarato@enron e.haedicke@enron k.lay@enron
HLT 1 : Average Word Count Histogram Score all d i for each vertex: (k.lay@enron) Average scores 0.72 0.11 Interes-ng 0.53 0.42 M j.lovarato@enron e.haedicke@enron k.lay@enron
HLT 1 : Average Word Count Histogram Score all d i for each vertex: (k.lay@enron) Average scores 0.72 Interes-ng 0.11 0.53 0.42 0.45 M j.lovarato@enron e.haedicke@enron k.lay@enron
HLT 2 : Closest Word Count Histogram Score all d i for each vertex: (k.lay@enron) Score only by closest document 0.72 Interes-ng 0.11 0.53 0.42 0.72 M j.lovarato@enron e.haedicke@enron k.lay@enron
HLT 3 : Compression Language Modeling How well does a given message compress? Repeated sequences compress well Novel sequences do not compress well
HLT 3 : Compression Language Modeling Discrimina-ve Model which model fits be[er? Interes-ng 0.13 0.38 Red Uninteres-ng
HLT 3 : Compression Language Modeling M e.haedicke@enron j.lovarato@enron Interes-ng k.lay@enron r.shapiro@enron Uninteres-ng a.zipper@enron b.rogers@enron V\M
HLT 4 : Topic Modeling Latent Dirichlet Alloca-on (ala Blei, Wallach, McCallum, Mimno,...) [we use mallet] Each document is a mixture of topics Each topic is a probability distribu-on over W mallet is available from h[p://mallet.cs.umass.edu/
HLT 4 : Topic Modeling j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron
HLT 4 : Topic Modeling j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron
HLT 4 : Topic Modeling M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M
HLT 4 : Topic Modeling M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M
HLT 4 : Topic Modeling M j.lovarato@enron e.haedicke@enron k.lay@enron r.shapiro@enron a.zipper@enron b.rogers@enron V\M
HLT 4 : Topic Modeling Score all d i for each vertex: (k.lay@enron) Sum the weight given to topics of interest M j.lovarato@enron e.haedicke@enron k.lay@enron
HLT 4 : Topic Modeling Score all d i for each vertex: (k.lay@enron) Sum the weight given to topics of interest M j.lovarato@enron e.haedicke@enron k.lay@enron
HLT 4 : Topic Modeling Score all d i for each vertex: (k.lay@enron) Sum the weight given to topics of interest M j.lovarato@enron e.haedicke@enron k.lay@enron
HLT 4 : Topic Modeling Score all d i for each vertex: (k.lay@enron) Sum the weight given to topics of interest 0.12 M j.lovarato@enron e.haedicke@enron k.lay@enron
Individual Analy-c Performance 0.0 0.1 0.2 0.3 0.4 MAP Context Average WCH Minimum WCH Compression Topic Modeling
Individual Analy-c Performance 0.0 0.1 0.2 0.3 0.4 MAP * Context Average WCH Minimum WCH Compression Topic Modeling
Outline Introduc-on Method Importance Sampling Evalua-on Analy-cs Content and Context Fusions Conclusions & Future Direc-ons
Linear Fusion F = γ HLT 1 + (1- γ) Context 2 Analy-cs
Linear fusion 2 Analy-cs Performance MAP 0.25 0.0 0.2 0.4 0.6 0.8 1.0 0 0 0.0 0.2 0.4 0.6 0.8 1.0 Context γ 1 Content
Linear fusion 2 Analy-cs Performance MAP 0.25 0.0 0.2 0.4 0.6 0.8 1.0 0 Grid search over Content and Context 0 0.0 0.2 0.4 0.6 0.8 1.0 Context γ 1 Content
Linear Fusion F = γ HLT 1 + (1- γ) Context 2 Analy-cs F = Σ i (γ i HLT i ) + (1- Σ i γ i ) Context Arbitrary number of Analy-cs NB: Scores must be calibrated.
MAP MRR p@5 p@10 Gridsearch Linear Combina-ons 0.0 0.1 0.2 0.3 0.4 0.5 MAP
Rank Fusion Fuse ranks instead of scores. Rank ver-ces by each analy-c. Each vertex represented by vector of ranks Fusion score is a func-on of that vector Min() One measure can damn you Max() One measure can save you Median() Something in between
Rank Fusion MAP 0.0 0.1 0.2 0.3 0.4 0.5 min median max min median max 3 Analy-cs 5 Analy-cs
Rank Fusion MAP 0.0 0.1 0.2 0.3 0.4 0.5 min median max min median max 3 Analy-cs 5 Analy-cs
Rank Fusion MAP 0.0 0.1 0.2 0.3 0.4 0.5 min median max min median max 3 Analy-cs 5 Analy-cs
Individual Rank Fusions Gridsearch Linear Combina-ons Overall Comparisons 0.0 0.1 0.2 0.3 0.4 0.5 MAP
Individual Rank Fusions Gridsearch Linear Combina-ons Overall Comparisons 0.0 0.1 0.2 0.3 0.4 0.5 MAP
Individual Rank Fusions Gridsearch Linear Combina-ons O(n) in number of Analy-cs 0.0 0.1 0.2 0.3 0.4 0.5 MAP
Individual Rank Fusions Gridsearch Linear Combina-ons O(x n ) in number of Analy-cs!! 0.0 0.1 0.2 0.3 0.4 0.5 MAP
Individual Rank Fusions Gridsearch Linear Combina-ons Nevermind O(.) for a moment 0.0 0.1 0.2 0.3 0.4 0.5 MAP ~11 minutes ~13 days
Individual Rank Fusions Gridsearch Linear Combina-ons Nevermind O(.) for a moment 0.0 0.1 0.2 0.3 0.4 0.5 MAP ~11 minutes ~13 days
Linear fusion 3 Analy-cs
Which HLTs should I use? Average WCH + Minimum WCH + Topic Modeling 0.1 Other HLTs 0.9 0.8 0.6 0.7 0.7 0.8 0.9 0.6 0.5 0.1 0.2 0.4 0.3 0.4 0.5 0.4 0.3 0.2 0.3 0.5 Compression 0.6 0.2 0.7 0.8 0.1 0.9 Context
Which HLTs should I use? Need to find them all Need to find only one more 0.1 Other HLTs 0.9 0.8 0.6 0.7 0.7 0.8 0.9 0.6 0.5 0.1 0.2 0.4 0.3 0.4 0.5 0.4 0.3 0.2 0.3 0.5 Compression 0.6 0.2 0.7 0.8 0.1 0.9 Can only look at 5 or 10 Context
Outline Introduc-on Method Importance Sampling Evalua-on Analy-cs Content and Context Fusions Conclusions & Future Direc-ons
Ques-ons How should we fuse? For small number of analy-cs grid search linear For large number of analy-cs rank fuse (or think harder) What kind of HLTs should we use? Depends on your inference task, of course. We can actually recommend what to use! Insight into rela-ve strengths of HLTs exposed.
Future and Related Work This is post- hoc fusion of scores or ranks, what about more na-ve fusion? Different data and inference tasks What movie should I watch? (Ne^lix) Who should I cite? (ACL) Who should I work with? (Github) Vote predic-on (Congressional Bills)
Collaborators Carey Priebe [JHU HLTCOE] Allen Gorin [JHU HLTCOE] Richard Cox [JHU HLTCOE] David Marche[e [Navy] Yongser Park [JHU CIS] Minh Tang [JHU AMS] Michael Decerbo [Raytheon BBN] Hanna Wallach [UMass Amherst] Jim Mayfield [JHU APL/HLTCOE] Paul McNamee [JHU APL/HLTCOE] William Szewczyk [DoD]
Collaborators Carey Priebe [JHU HLTCOE] Allen Gorin [JHU HLTCOE] Richard Cox [JHU HLTCOE] David Marche[e [Navy] Yongser Park [JHU CIS] Minh Tang [JHU AMS] Michael Decerbo [Raytheon BBN] Hanna Wallach [UMass Amherst] Jim Mayfield [JHU APL/HLTCOE] Paul McNamee [JHU APL/HLTCOE] William Szewczyk [DoD]
Thank you. Coppersmith & Priebe (submi[ed): [arxiv.org/1201.4118] Latest papers (oƒen) available from glencoppersmith.com
Content of Importance Sample California Power.. Portland...San Diego..... WBI k.lay@enron California Power.. Portland.. Portland..... bcf.. WBI Energy Services billions of cubic feet e.haedicke@enron California Power.. San Diego..... WBI...bcf.... j.lovarato@enron
Content of Importance Sample r.shapiro@enron Pro Football.. Raiders... touchdown.....implode b.rogers@enron 9/11.. 9/11.. New York.... tragedy a.zipper@enron
Context of Importance Sample
Context of Importance Sample Observed s 1 = 0.45
Context of Importance Sample Observed p 1 = 0.13
Context of Importance Sample Observed p 1 = 0.12
Context of Importance Sample Observed p 1 = 0.13 Observed s 1 = 0.45 Observed p 1 = 0.12
Content of Importance Sample 0 + 12.. Portland...San Diego..... WBI k.lay@enron.. Portland.. Portland..... bcf.. WBI Energy Services billions of cubic feet e.haedicke@enron 1 + 8.. San Diego..... WBI...bcf.... j.lovarato@enron 5 + 31
Comparing HLT Methods MRR 0.0 0.2 0.4 0.6 0.8 1.0 + + + X + + + X + + + + + N X X X X X + + N + + + + X N N N N N X X N + + N N + X + X N N N + X N N N N N N N N N N + + N X + + N N+ N N X N N N N N N N N N N N N N N N+ N + + 0.0 0.1 0.2 0.3 0.4 S 1
Importance Sampled Joint Δ Topic 0.5 1.0 1.5 2.0 0.0 0.1 0.2 0.3 0.4 0.5 Δ ρ
Injec-on Importance 0.06 0.06 0.05 0.05 0.04 0.04 P 0 0.03 0.02 0.03 0.02 0.01 0.01 0.00 0.00 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.00 0.01 0.02 0.03 0.04 0.05 0.06 P 1 P 1 0.6 0.6 0.5 0.5 0.4 0.4 S 0 0.3 0.2 0.3 0.2 0.1 0.1 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 S 1 S 1