Preliminary Experiments on Literature Based Discovery using the Semantic Vectors Package

Size: px

Start display at page:

Download "Preliminary Experiments on Literature Based Discovery using the Semantic Vectors Package"

Ambrose Spencer
6 years ago
Views:

1 Preliminary Experiments on Literature Based Discovery using the Semantic Vectors Package M. Heidi McClure Intelligent Software Solutions, Inc Colorado Springs, CO Abstract This paper presents a literature based discovery (LBD) implementation that uses Lucene for indexing, the Semantic Vectors (SV) package for latent semantic analysis, Neo4j for graph database storage, Gephi for visual representation along with custom code written by the author. The approach of using a latent semantic analysis based systems like SV to do LBD is not new, but going the next steps of examining related concepts and using a graph database representation for finding candidate linking terms is. The LBD system is a framework where relation extraction experiments may be performed. This paper presents work that is in progress. Keywords: literature based discovery, semantic vectors package, relation extraction 1. Introduction Literature based discovery (LBD) has been around since the 1980 s when D. R. Swanson first defined LBD as a means to discover previously unknown knowledge by examining term occurrences across multiple documents [16]. This paper presents an implementation that uses variations on latent semantic analysis (LSA) to perform LBD-style discovery. Studies of using LSA to do LBD have been documented by various authors, among them are [14], [11], [19]. LSA discovers semantically related concepts giving stronger correlation scores to more strongly related concepts. When a pair of concepts that have a significant semantic relatedness score from LSA and that are never mentioned in the same document or documents, you have an LBD candidate pair. The other pairs of candidate related concepts are that are mentioned in one or more of the same documents may also provide important information. These two sets - LBD candidate pairs and share-document-mention (SDM) pairs present data that allows a next level of discovery which is the discovery of candidate linking terms. The process of using graph database navigation across all related concepts found with LSA to discover new linking-concept candidates for an LBD candidate pair is the new approach presented in this paper. The system presented in this paper handles these tasks: corpus generation, candidate related entities discovery and relationship analysis. The corpus generation retrieves document data from databases or from text files (including XML). All data is pre-processed by applying various normalization tasks to the data. The pre-processed data is placed on the filesystem where each document is in a separate text file. Candidate related entities discovery task indexes the data, performs forms of LSA and identifies the concept pairs with the highest relatedness scores. Relationship analysis is the final task of identifying the candidate linking concepts. This is done by first retrieving all documents relating to a candidate pair, determining which are LBD and which are SDM candidate pairs. Then the concept linked to concept as an LBD or an SDM pair is captured in a graph database. Last candidate linking terms are discovered by traversing the graph database. This work is not complete and is a work in progress and will provide the framework system on which more advanced relationship discovery will be performed. The rest of this paper presents some background information on LBD, LSA, and Random and Reflective Indexing (RI and RRI) which are enhancements to LSA. It presents some details of software components and data used in the design and test of the system including the Semantic Vectors (SV) package, Lucene, Neo4J, Gephi and the MEDLINE data set. It presents the design of the system and discussion of the results. The end of this paper presents some conclusions and ideas for the future directions of this work. 2. Background 2.1 Literature Based Discovery LBD is the discovery of hidden knowledge in large sets of documents where the discoveries relate concepts A and C together. In LBD, a single document in the corpus will not contain the discovery. Sometimes a linking term, B, may be the means by which A and C are discovered and B would be in all documents containing A or C. In statistical approaches to LBD, there may not be a linking B term in the discovery - instead, A and C are discovered by semantic relatedness of the documents using, for example, latent semantic analysis (LSA) techniques. Once candidate discoveries are found, experiments may be performed to prove or disprove the hypotheses. As various works state, Don R. Swanson [16] is considered to be the first to have mentioned the LBD style of discovery [13], [10], [15]. Swanson manually reviewed medical journals in one of his studies and found mentions of Raynaud s syndrome and its relation to various blood issues like problems with blood

2 viscosity, platelet function and vascular reactivity. Then he looked for discussions of these blood issues in articles that did not mention Raynaud s and found associations between the blood problems and fish oil. The connection made is that, perhaps, Raynaud s syndrome could be treated with fish oil. Medical studies have since been able to validate this discovery. Since his initial study, Swanson and many other researchers started to apply automation to the task of LBD. Initially works would look for candidate B linking terms simply by their cooccurrence in the A-term documents then use these B terms to candidate documents where A is not mentioned. Then would try to find, again with co-occurrence, candidate C terms in the new set of documents. They sometimes used lists of words (vocabularies) to restrict which words they considered in each search - when list is used for both A and C concepts, the LBD system is called a closed system. An open system considers a a list of words only for the starting concept, A. [10] This initial work is a closed LBD work, but future work plans to use the open approach. 2.2 Latent Semantic Analysis LSA came initially from the Latent Semantic Indexing (LSI) work done by Scott Deerwester, Susan Dumais and and others [9] and used in LBD by Michael Gordon and Dumais [11] - LSA does not require, necessarily, a vocabulary, but, instead, finds similar documents based on LSI. LSA assumes that if terms or concepts are found in similar sets of text (not always the same text) then these terms or concepts may be related, the same or similar concepts. The mathematics behind LSI uses singular value decomposition (SVD) to reduce the dimensions of extremely large matrices by getting rid of less interesting data and to discover the related terms in documents. LSI proved to be more efficient than previous methods and has been moderately successful, however, it s still slow and computationally expensive. 2.3 Random and Reflective Random Indexing More recently, Cohen, et al. [8] experimented with random indexing (RI) - a more scalable version of LSI - and extended the RI concepts to support indirect inference. Indirect inferences is what Cohen, et al. sometimes call LBD. RI is uses a random approach to further reduce the size of matrices being analyzed to discover similar terms in documents. Instead of a full term by document matrix, documents are placed into small sets of columns. For example, if there are 10,000 documents, a document may be assigned to 20 randomly chosen columns. Each document s term frequency information is tallied in each of its columns along with any other document that was randomly assigned [12]. Cohen, et al. also experimented with variations of RI - Sliding windows on RI, Term based Reflective Random Indexing (RRI), and Document based RRI. RRI uses RI but does it using results from one RI process and feeding it into Fig. 1: Architecture another pass of RRI. Term and document based RRI vary how the random indexing is chosen - by term or by document in various passes through the RRI. Their claim is that these techniques provide more related terms/concepts that may not co-occur in the same document but are possibly related. They state that their use of RRI techniques is better suited for LBD. 3. Design This section describes the general design of the SV-based LBD system. There are multiple steps performed that ultimately present pairs of entities that may be related and presents candidate linking terms. The general architecture of the system is shown in Figure 1. The steps noted below align with the small boxes in the architecture diagram. 1) Retrieve data and place each document into a separate text file on the computer filesystem. The data tested so far, has been in database as reports summarizing, for example, news articles, as web pages or as other text data. Data also has been in databases as copies of s and has been in large XML files which needed to be broken up to get text documents. An example of XML files is the MEDLINE data presented in this paper where the resulting document pulled from the XML is just the abstract of the article. 2) Pre-process the corpus mostly to identify the concepts we wish to analyze. This makes a copy of the original filesystem documents and tags the new documents as necessary. Today this consists of doing some simple normalization of the data turning John Doe into johnxxxdoe so that the next step of indexing doesn t need to have custom indexing capabilities implemented. In other words, multi-word tokens have been turned into single words. Lowercase is being used because the SV package works best with all concepts in lower case. 3) Index the corpus using Lucene indexing as described in the semantic vectors package. The SV package uses the Lucene index to help build the SV vector files. See [6] for more information on Lucene. 4) Apply semantic vectors processing to create document and term vector files. Term vectors are not simple word

3 count per document vectors. They are vectors where the entries represent a term s relatedness to random sets of documents. Comparing the vector for two different terms (i.e. concepts) presents a score that gives the LSA relatedness of the two terms. 5) Candidate related entity identification is performed by comparing the term vectors for each entity as discussed in previous step. Some term pairs come up with a 1.0 score indicating the strongest relatedness - since there s probably nothing new to discover with this pair, these terms are ignored in LBD processing. Similarly, terms with zero or negative scores are also ignored. 6) Retrieve documents for the related entities. Finding these documents is done by going back to the original Lucene index created before SV processing. Returned are documents that mention either of the entities or both. At this time, identification of LBD candidates is done by finding the pairs of related entities that are never mentioned in the same documents. 7) Relationship Identification involves a) examining documents where entities appear together b) when entities are LBD candidates, identify candidate linking B terms. This may be done navigating the graph looking for terms that are linked to both the A and the C terms. There are examples of this presented in the results section. Explaining why entities may be related is the primary area of future work planned by the author. In the next subsections, a small amount of detail is provided about some of the tools and API s used in this work. 3.1 Semantic Vectors Package A byproduct of Cohen s work to improve LSI [8] is the Semantic Vectors (SV) package. This open source java software initially written by Cohen s co-author, Dominic Widdows, and now maintained by he and many others under a Google code project - [18]. SV provides a library of capabilities that perform random indexing which performs much faster than SVD. SVD is an N x N problem where matrices will get to a size that current computing capabilities will now allow them to be computed. RI can do LSA-like analysis on millions of documents. For this work, terms are compared to identify relatedness. 3.2 Lucene indexing Lucene [6] is a set of java libraries that, among other things, allows for the searching of terms and phrases in sets of text documents or other representations of text like PDF s, HTML, Microsoft Word, etc. Lucene creates index files that contain the necessary information to not only find terms or phases quickly that may be contained in a corpus, it also is able to indicate where in the document the terms or phrases are. Lucene provides fast and efficient search capabilities. In the LBD solution presented here, Lucene indexes are created first and then the Semantic Vectors package is used to find candidate LBD pairs. Once pairs are found, the Lucene indexes are again referenced to find the documents in which entities are mentioned. 3.3 Graph database - neo4j Running SV with term comparisons across a set of documents discovers pairs of concepts that may be related. A graph database that represents node X, node Y and the link joining node X and Y is a good choice for storing results of SV or any other results of latent semantic analysis. Graph databases are sometimes called NoSQL databases since data is not stored in a traditional table structure. Neo4j is an open source light weight graph database [7]. For this project, nodes are the concepts (for example, A, B and C concepts) and the links are either an LBD link where the nodes on either side of the link are never mentioned in the same document or a shared document link where the nodes on either side are both mentioned in one or more documents. 3.4 Visualization - Gephi Once data is stored in a neo4j database, visualizing that data is important in order to assist in the analysis of results. Gephi is an open source graph visualization tool [5]. Martin Skurla created a neo4j plugin into Gephi during Google s summer of code, [4] Once a neo4j database is loaded into Gephi you can show only LBD relationships (links) or show only shared common document relationships. The graph (link) pictures shown in this paper were created from snapshots of Gephi displays of neo4j databases. 4. Experimentation and Results 4.1 Data Sets The current application allows flexibility in what steps of the LBD process are run and flexibility in what data are used. For example, MEDLINE data used in experiments presented here are pulled out of XML files using python scripts that place each abstract into a separate file on the filesystem. Data may come from, for example, an Enron corpus that may be in a database or may just be files on a filesystem [1], [3]. When data comes from database queries, the document data usually is read from large text fields or sets of fields - the system presented here places this data onto the filesystem - one file per document. MEDLINE is a National Library of Medicine database of biomedical literature that includes, among other things, the

4 Fig. 2: Sample of related pairs Fig. 3: Sample LBD Pairs Fig. 4: Sample Candidate B Terms titles and abstracts of over 21 million articles from over 5000 different journals and publications [2]. This data is stored in many XML files. For this effort, the python script used pulled each abstract for the years of interest and placed each into a file on the filesystem. Initial experimentation used all the abstracts from which simulates rough dates used by Don Swanson in his initial LBD work [16]. This date range created 692,382 total file documents. 4.2 Output MEDLINE data was analyzed by the system using approximately 190 candidate A and C concepts that were gathered from two papers - one that discussed fish oil and Raynaud s [11] and the other that discussed both fish oil - Raynaud s connection and migraines and magnesium connection [17]. During the processing of results, a neo4j database was created that ultimately allows for visual display of results. The Neo4j database contains nodes that represent documents and links that are named either an LBD relationship or a share-document-mention (SDM) relationship. If the relationship between two nodes is an LBD relationship, that means the nodes were never mentioned together in the same document. The system also creates summary results in a set of text output files. In order to keep the number of result pairs to consider a manageable size, an algorithm is currently used that returns either a hard number of results or a percentage of total pairs discovered. Some of the generated output files contain: 1) All related pairs - pairs that share mentions in documents and those that don t. The ones that don t will be LBD pairs. The SV score is also captured for each pair. See Figure 2 2) All LBD pairs - in order to discover the LBD pairs, the Lucene index is queried to find all documents in which the terms are mentioned. If there are no documents in common (bothaandcdocs = 0 in the LBD Pairs output), then this is an LBD pair. For preliminary work, only a maximum of 100 document matches is allowed for this result reporting. See Figure 3 3) All candidate linking terms - this file is the result of navigating the Neo4j graph database to find candidate B linking terms between the LBD pairs reported in the second list (All LBD Pairs). The first two terms are the LBD pair, the last is the candidate B term. See Figure 4

Fig. 5: Highlight Raynaud s and see LBD candidate pairs Fig. 6: Terms that share document references - A and C concepts, Lupus and Raynaud s, linked by B concept, Prostacyclin 4.

5 Fig. 5: Highlight Raynaud s and see LBD candidate pairs Fig. 6: Terms that share document references - A and C concepts, Lupus and Raynaud s, linked by B concept, Prostacyclin 4.3 Analysis In addition to the text file results, the Neo4j database loaded into Gephi, the graph visualization tool, provides help in visualizing what s in the data. In Figure 5, all of the medical concepts that had an LBD relationship with Raynaud s phenomenon are highlighted. Simply mousing over the Raynaud s node performs this highlighting. From this visualization, we see that Raynaud s phenomenon is related, via LBD, to Lupus, selective beta blockers and fish oil - again, it s LBD because the links shown are only the LBD candidate pairs. As in Swanson s work [16], fish oil and Raynaud s phenomenon are never mentioned in the same documents. In Figure 6, only links where the entities actually are mentioned in the same documents are shown. This particular view where prostacyclin is highlighted shows that prostacyclin may be a linking B concept between the A and C concepts - Lupus and Raynaud s phenomenon. In Figure 7, we can see that adrenergic alpha receptor is a concept relating selective beta blockers and Raynaud s phenomenon together. In this same diagram, we notice that fish oil had no candidate linking terms to Raynaud s - this is probably due to the small subset of A and C concepts analyzed in this early testing (as noted earlier, approximately 190 terms were used for candidate A and C concepts). 5. Conclusion I have presented an approach to discovering hidden knowledge in documents using a latent semantic analysis variant from the semantic vectors package. My approach discovers candidate

6 Fig. 7: Terms that share document references - A and C concepts, selective beta blockers and Raynaud s, linked by B concept, adrenergic alpha receptor. Note that no candidate term found between Raynaud s and fish oil. A and C concepts or terms which, although never mentioned in the same document, may be related. I have also discovered candidate linking or B terms that relate the A and C. To the best of my knowledge and research, identifying the linking B terms from the LSA results is something that has not previously been done. This system provides the platform on which alternative approaches may be tried to improve the quality of the discovered pairs. 6. Future I hope to complete the following major tasks. 6.1 Complete the Reproduction of Swanson Results In order to test the results of this system, I have successfully shown that fish oil and Raynaud s are semantically related and are never mentioned in the same documents in the corpus. Using the evaluation methodology described by Meliha Yetisgen-Yildiz and Wanda Pratt [20], the next step is to examine the corpus of documents after 1984 to see if the LDB candidates are mentioned in the same documents, thus, perhaps, proving that there is now a confirmed and important relationship between the LBD candidate pairs. There are other published results based on MEDLINE that use LBD to discover previously unknown related concepts. I may try to recreate those discoveries, also. 6.2 Making Operational Solution I plan to examine how analysts, scientists or other users might use the capabilities of the system presented in this paper. If I can refine the system so that fewer false positives are presented, users may find great benefit in the kind of discoveries that LBD presents - that is, the previously unknown and hidden knowledge that is in existing data. 6.3 Relationship Extraction - Discovering the Why I plan to take the results - the candidate LBD pairs and perhaps the linking B terms - and try to figure out why the concepts are related. That is, try to explain the relations. Initially, I plan to try relationship extraction approaches that have already been published with the hopes of refining these techniques to fit with that data and to provide more accurate results. 6.4 Open LBD The approach I have taken to date is a closed LBD solution where we know what A and C concepts to study. The next step is to expand this work to perform Open LBD where the C concepts are not known. To do this, candidate C concepts are discovered using natural language processing (NLP) namedentity (NE) extraction techniques to identify the kinds of concepts we are interested in. For example, when using medical domains, drugs, diseases, symptoms or side effects will be tagged by NLP NE extraction and then used as candidate C

7 concepts. When studying, for example, the Enron corpus, people, places, organizations or events will be tagged. Acknowledgments Work described in this paper was funded by Intelligent Software Solutions as an internal research and development project (IRAD). The system described in the paper is patent pending, all rights reserved. References [1] Enron data set. enron/. [2] Medline/pubmed data files. or [3] Mysql enron database. adibi/enron/enron.htm. [4] Plugin for visualizing neo4j graphs in gephi, [5] Gephi graph visualization tool, [6] Lucene, [7] Neo4j - an open source graph databases. website, [8] Trevor Cohen, Roger Schvaneveldt, and Dominic Widdows. Reflective random indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics, 43(2): , [9] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41: , [10] Murat C. Ganiz, William M. Pottenger, and Christopher D. Janneck. Recent advances in literature based discovery. Technical Report LU- CSE , Lehigh University, [11] Michael D. Gordon and Susan Dumais. Using latent semantic indexing for literature based discovery. J. Am. Soc. Inf. Sci., 49: , June [12] Pentti Kanerva. Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation, 1: , /s [13] Joel A Solka Jeffrey L Briggs Michael B Rushenberg Robert L Stump Jesse A Johnson Dustin Lyons Terence J Wyatt Jeffrey R Kostoff, Ronald N Block. Literature-related discovery (lrd). Technical Report ADA473438, OFFICE OF NAVAL RESEARCH, November [14] Robert K. Lindsay and Michael D. Gordon. Literature-based discovery by lexical statistics. J. Am. Soc. Inf. Sci., 50: , May [15] Aditya Kumar Sehgal. Profiling topics on the Web for knowledge discovery. Phd diss, University of Iowa, [16] D. R. Swanson. Fish oil, raynaud s syndrome, and undiscovered public knowledge. Perspect Biol Med, 30(1):7 18, 1986 Autumn. [17] Marc Weeber, Henny Klein, Lolkje T.W. de Jong-van den Berg, and Rein Vos. Using concepts in literature-based discovery: Simulating swanson s raynaud-fish oil and migraine-magnesium discoveries. Journal of the American Society for Information Science and Technology, 52(7): , [18] Dominic Widdows et al. Semantic vectors package, [19] M. Yetisgen-Yildiz and W. Pratt. Using statistical and knowledge based approaches for literature-based discovery. Journal of Biomedical Informatics, 39(6): , [20] Meliha Yetisgen-Yildiz and Wanda Pratt. A new evaluation methodology for literature-based discovery systems. Journal of Biomedical Informatics, 42(4): , 2009.

Supporting Discovery in Medicine by Association Rule Mining in Medline and UMLS

Supporting Discovery in Medicine by Association Rule Mining in Medline and UMLS Dimitar Hristovski a, Janez Stare a, Borut Peterlin b, Saso Dzeroski c a IBMI, Medical Faculty, University of Ljubljana,