Cross Corpora Discovery Via Minimal Spanning Tree Exploration

Size: px

Start display at page:

Download "Cross Corpora Discovery Via Minimal Spanning Tree Exploration"

Abner Todd
6 years ago
Views:

1 Cross Corpora Discovery Via Minimal Spanning Tree Jeff Solka, Edward J. Wegman, and Avory Bryant 5/28/2004 1

2 In a Nutshell? o What are we trying to do? Develop a semi-automated system to facilitate the discovery of articles from disparate corpora that may contain subtle relationships. o What is our approach predicated on? The synthesis of methodologies from statistics, mathematics and visualization. o What is our test case? Roughly 1200 Science News abstracts that have been precategorized into 8 categories. 2

3 The Science News Corpus o 1117 documents from o Obtained from the SN website on December ,2002 using wget. o Each article ranges from 1/2 a page to roughly a page in length. o The corpus html/xml code was subsequently parsed into straight text. o The corpus was read through and categorized into 8 categories.

4 The Science News Corpus Breakdown o Anthropology and Archeology (48). o Astronomy and Space Sciences (124). o Behavior (88). o Earth and Environmental Sciences (164). o Life Sciences (174). o Mathematics and Computers (65). o Medical Sciences (10). o Physical Sciences and Technology (144) 4

5 Our Approach to be Discussed Today (5/26/04) Multi-Discipline Document Set Feature Extraction (Denoising, stemming, BPM, TPM) Interpoint Distance Calculation Minimal Spanning Tree (MST) Calculation MST Layout Via Spring Based Models Cross Corpora Discovery Via MST 5

6 Denoising and Stemming o These steps are performed prior to subsequent feature extraction steps. o Denoising consists of removal of all words that appear on a stopper or noise word list. the, a, an, o Stemming transforms a given word into its base walking walk walked walk 6

7 Net Algorithmic Complexity o Let p be equal to the average number of word pairs or triplets in each document. o The net algorithmic complexity is O(n 2 p) o It would be easy to formulate parallel computation strategies in both n and p. o Note that the computational complexity associated with the actual calculation of the BPM is not included here. 7

8 Feature Extraction (Bigram Proximity Matrix (BPM) & Trigram Proximity Matrix (TPM)) The wise young man sought his father in the crowd. 8

9 Evidence That BPM and TPM Capture Semantic Content o Angel Martinez, A Framework for the Representation of Semantics, Ph.D Dissertation under the direction of Edward Wegman, October Supervised Learning. Hypothesis Tests ( sets of tests). Unsupervised Learning. Supervised Learning in a Reduced Dimension Space. 9

10 Similarity Measures and Pseudometrics on the BPM o Following Martinez (2002) we propose the use of the Ochiai measure in the case of the BPM: S ( X, Y ) = X and Y ( X Y ) o This is converted to a distance via: ( 2 2S( X Y )) d ( X, Y ) =, 10

11 Interpoint Distance Complexity Issues o Let n be the number of documents in the corpus. The interpoint distance matrix involves (n(n-1))/2 comparisons which results in an O(n 2 ) operation in the number of documents n. o It will pay to make each of these comparisons as efficient as possible. 11

12 Similarity and Distance Measure Complexities o o Let x be the set of word pairs or triplets in Article X. Let y be the set of word pairs or triplets in Article Y. Then Article X AND Article Y can be described as the intersection of the sets x and y. The sets x and y are represented as hash tables where the key, word pair or triplet, maps to the number of occurrences in the article. The intersection of x and y can then be computed by the number of keys in x that are also in y. The contains Key function being used is close to O(1) so the computation of X AND Y should be close to O(size of x) or for all keys in x check if y contains key. The value of Article X Article Y is size of x size of y S ( X, Y ) = X and Y ( X Y ) ( 2 2S( X Y )) d ( X, Y ) =, 12

13 What Do We Have at This Point in The Process? o An interpoint distance matrix between each of the articles. Hopefully articles that are close semantically will be close in this matrix. Hopefully articles that are far apart semantically will be far apart in this matrix. o We also have a previously obtained categorization of the articles obtained via: Human feedback. Automatic process. 1

14 How Do We Exploit This Interpoint Distance Matrix? o o o o First order exploitation. Look for the closest points between each pair of categories (corpora). Second order exploitation. Look for those articles that are along the boundary that separates the two categories. Third order exploitation. Look for those articles that have the same relationship to the discriminate boundary. Fourth order exploitation. Allow the user to drive the interpoint distance geometry via identification of first, second, and third order interesting relationships and subsequent regeometrization.

15 The Minimal Spanning Tree (MST): A Strategy for Effective of the Interpoint Distance Matrix o Definition (Minimal Spanning Tree (MST)) The collection of edges that join all of the points in a set together, with the minimum possible sum of edge values. The edge values that will be used here is the distance measures stored in our interpoint distance matrix. A complete graph. Associated MST. 15

16 Calculation of the MST : Kruskal s Algorithm

17 Calculation of the MST : Kruskal s Algorithm

18 Calculation of the MST : Kruskal s Algorithm

19 Calculation of the MST : Kruskal s Algorithm

20 Calculation of the MST : Kruskal s Algorithm

21 Calculation of the MST : Kruskal s Algorithm

22 Calculation of the MST : Kruskal s Algorithm

23 Calculation of the MST : Kruskal s Algorithm

24 MST Classifier Complexity Characterization Previous work had suggested that the number of cross class edges can be used as a surrogate for classification complexity. These cross class (corpora) edges will be used in our scheme to facilitate the cross-corpora discovery process. 24

25 Implementation Issues (The Devil in the Details) o BPM extraction and interpoint distance calculation: Implemented in C#. o BPM similarity and distance calculation: Implemented in C#. o MST calculation: Implemented using Kruskal s algorithm in JAVA. o Visualization environment: Implemented in JAVA. Graph layout facilitated using TouchGraph. 25

26 TouchGraph o o o TouchGraph is a general public license JAVA-based library for the visualization of graphs. ( Graph layout in TouchGraph: When a graph is first loaded, nodes start out at the center with slightly random positions, and then spread out because of node-node repulsions. Graph manipulation tools provided by TouchGraph. Zooming. Rotation. Hyperbolic manipulation. Graph dragging. 26

27 The Environment (Opening Screen) 27

28 The Environment (MST) Blue is anthropology and archaeology. Pink is behavior.

29 The Environment (The Comparison File) 29

30 The Demo 0

31 Wrap-up o Demonstrated a new method for cross corpora document discovery o Method predicated on the use of BPM and the MST as a convenient foil for the exploration of the cross corpora relationships. o This work represents the tip of the iceberg of a new area that is not only of strategic importance to the United States but also is highly relevant to all who are currently conducting research in any discipline. 1

32 Backup Slides 2

33 An Alternate Approach Multi-Discipline Document Set Exemplar Term Production Via Synonym Analysis Feature Extraction (BPM, TPM) Dimensionality Reduction ISOMAP/LLE Model-Based Clustering With Adaptive Mixtures Initialization Serendipity Identification and Visualization

34 A Paradigm you don t reach Serendip by plotting a course for it. You have to set out in good faith for elsewhere and lose your bearings serendipitously. -- John Barth, The Last Voyage of Somebody the Sailor 4

35 Acknowledgements o Jim Gentle (Opportunity to speak) o Algotek (Funding and Program Management) Anna Tsao o Algotek Team (Helpful discussions and encouragement) Carey Priebe David Marchette 5

36 The Porter Stemming Algorithm o The Porter stemming algorithm (or Porter stemmer ) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalization process that is usually done when setting up Information Retrieval systems. ( official home page for distribution of the Porter Stemming Algorithm 6

Interactive Text Mining with Iterative Denoising

Interactive Text Mining with Iterative Denoising, PhD kegiles@vcu.edu www.people.vcu.edu/~kegiles Assistant Professor Department of Statistics and Operations Research Virginia Commonwealth University Interactive