Preliminary Experiments on Literature Based Discovery using the Semantic Vectors Package

Size: px
Start display at page:

Download "Preliminary Experiments on Literature Based Discovery using the Semantic Vectors Package"

Transcription

1 Preliminary Experiments on Literature Based Discovery using the Semantic Vectors Package M. Heidi McClure Intelligent Software Solutions, Inc Colorado Springs, CO Abstract This paper presents a literature based discovery (LBD) implementation that uses Lucene for indexing, the Semantic Vectors (SV) package for latent semantic analysis, Neo4j for graph database storage, Gephi for visual representation along with custom code written by the author. The approach of using a latent semantic analysis based systems like SV to do LBD is not new, but going the next steps of examining related concepts and using a graph database representation for finding candidate linking terms is. The LBD system is a framework where relation extraction experiments may be performed. This paper presents work that is in progress. Keywords: literature based discovery, semantic vectors package, relation extraction 1. Introduction Literature based discovery (LBD) has been around since the 1980 s when D. R. Swanson first defined LBD as a means to discover previously unknown knowledge by examining term occurrences across multiple documents [16]. This paper presents an implementation that uses variations on latent semantic analysis (LSA) to perform LBD-style discovery. Studies of using LSA to do LBD have been documented by various authors, among them are [14], [11], [19]. LSA discovers semantically related concepts giving stronger correlation scores to more strongly related concepts. When a pair of concepts that have a significant semantic relatedness score from LSA and that are never mentioned in the same document or documents, you have an LBD candidate pair. The other pairs of candidate related concepts are that are mentioned in one or more of the same documents may also provide important information. These two sets - LBD candidate pairs and share-document-mention (SDM) pairs present data that allows a next level of discovery which is the discovery of candidate linking terms. The process of using graph database navigation across all related concepts found with LSA to discover new linking-concept candidates for an LBD candidate pair is the new approach presented in this paper. The system presented in this paper handles these tasks: corpus generation, candidate related entities discovery and relationship analysis. The corpus generation retrieves document data from databases or from text files (including XML). All data is pre-processed by applying various normalization tasks to the data. The pre-processed data is placed on the filesystem where each document is in a separate text file. Candidate related entities discovery task indexes the data, performs forms of LSA and identifies the concept pairs with the highest relatedness scores. Relationship analysis is the final task of identifying the candidate linking concepts. This is done by first retrieving all documents relating to a candidate pair, determining which are LBD and which are SDM candidate pairs. Then the concept linked to concept as an LBD or an SDM pair is captured in a graph database. Last candidate linking terms are discovered by traversing the graph database. This work is not complete and is a work in progress and will provide the framework system on which more advanced relationship discovery will be performed. The rest of this paper presents some background information on LBD, LSA, and Random and Reflective Indexing (RI and RRI) which are enhancements to LSA. It presents some details of software components and data used in the design and test of the system including the Semantic Vectors (SV) package, Lucene, Neo4J, Gephi and the MEDLINE data set. It presents the design of the system and discussion of the results. The end of this paper presents some conclusions and ideas for the future directions of this work. 2. Background 2.1 Literature Based Discovery LBD is the discovery of hidden knowledge in large sets of documents where the discoveries relate concepts A and C together. In LBD, a single document in the corpus will not contain the discovery. Sometimes a linking term, B, may be the means by which A and C are discovered and B would be in all documents containing A or C. In statistical approaches to LBD, there may not be a linking B term in the discovery - instead, A and C are discovered by semantic relatedness of the documents using, for example, latent semantic analysis (LSA) techniques. Once candidate discoveries are found, experiments may be performed to prove or disprove the hypotheses. As various works state, Don R. Swanson [16] is considered to be the first to have mentioned the LBD style of discovery [13], [10], [15]. Swanson manually reviewed medical journals in one of his studies and found mentions of Raynaud s syndrome and its relation to various blood issues like problems with blood

2 viscosity, platelet function and vascular reactivity. Then he looked for discussions of these blood issues in articles that did not mention Raynaud s and found associations between the blood problems and fish oil. The connection made is that, perhaps, Raynaud s syndrome could be treated with fish oil. Medical studies have since been able to validate this discovery. Since his initial study, Swanson and many other researchers started to apply automation to the task of LBD. Initially works would look for candidate B linking terms simply by their cooccurrence in the A-term documents then use these B terms to candidate documents where A is not mentioned. Then would try to find, again with co-occurrence, candidate C terms in the new set of documents. They sometimes used lists of words (vocabularies) to restrict which words they considered in each search - when list is used for both A and C concepts, the LBD system is called a closed system. An open system considers a a list of words only for the starting concept, A. [10] This initial work is a closed LBD work, but future work plans to use the open approach. 2.2 Latent Semantic Analysis LSA came initially from the Latent Semantic Indexing (LSI) work done by Scott Deerwester, Susan Dumais and and others [9] and used in LBD by Michael Gordon and Dumais [11] - LSA does not require, necessarily, a vocabulary, but, instead, finds similar documents based on LSI. LSA assumes that if terms or concepts are found in similar sets of text (not always the same text) then these terms or concepts may be related, the same or similar concepts. The mathematics behind LSI uses singular value decomposition (SVD) to reduce the dimensions of extremely large matrices by getting rid of less interesting data and to discover the related terms in documents. LSI proved to be more efficient than previous methods and has been moderately successful, however, it s still slow and computationally expensive. 2.3 Random and Reflective Random Indexing More recently, Cohen, et al. [8] experimented with random indexing (RI) - a more scalable version of LSI - and extended the RI concepts to support indirect inference. Indirect inferences is what Cohen, et al. sometimes call LBD. RI is uses a random approach to further reduce the size of matrices being analyzed to discover similar terms in documents. Instead of a full term by document matrix, documents are placed into small sets of columns. For example, if there are 10,000 documents, a document may be assigned to 20 randomly chosen columns. Each document s term frequency information is tallied in each of its columns along with any other document that was randomly assigned [12]. Cohen, et al. also experimented with variations of RI - Sliding windows on RI, Term based Reflective Random Indexing (RRI), and Document based RRI. RRI uses RI but does it using results from one RI process and feeding it into Fig. 1: Architecture another pass of RRI. Term and document based RRI vary how the random indexing is chosen - by term or by document in various passes through the RRI. Their claim is that these techniques provide more related terms/concepts that may not co-occur in the same document but are possibly related. They state that their use of RRI techniques is better suited for LBD. 3. Design This section describes the general design of the SV-based LBD system. There are multiple steps performed that ultimately present pairs of entities that may be related and presents candidate linking terms. The general architecture of the system is shown in Figure 1. The steps noted below align with the small boxes in the architecture diagram. 1) Retrieve data and place each document into a separate text file on the computer filesystem. The data tested so far, has been in database as reports summarizing, for example, news articles, as web pages or as other text data. Data also has been in databases as copies of s and has been in large XML files which needed to be broken up to get text documents. An example of XML files is the MEDLINE data presented in this paper where the resulting document pulled from the XML is just the abstract of the article. 2) Pre-process the corpus mostly to identify the concepts we wish to analyze. This makes a copy of the original filesystem documents and tags the new documents as necessary. Today this consists of doing some simple normalization of the data turning John Doe into johnxxxdoe so that the next step of indexing doesn t need to have custom indexing capabilities implemented. In other words, multi-word tokens have been turned into single words. Lowercase is being used because the SV package works best with all concepts in lower case. 3) Index the corpus using Lucene indexing as described in the semantic vectors package. The SV package uses the Lucene index to help build the SV vector files. See [6] for more information on Lucene. 4) Apply semantic vectors processing to create document and term vector files. Term vectors are not simple word

3 count per document vectors. They are vectors where the entries represent a term s relatedness to random sets of documents. Comparing the vector for two different terms (i.e. concepts) presents a score that gives the LSA relatedness of the two terms. 5) Candidate related entity identification is performed by comparing the term vectors for each entity as discussed in previous step. Some term pairs come up with a 1.0 score indicating the strongest relatedness - since there s probably nothing new to discover with this pair, these terms are ignored in LBD processing. Similarly, terms with zero or negative scores are also ignored. 6) Retrieve documents for the related entities. Finding these documents is done by going back to the original Lucene index created before SV processing. Returned are documents that mention either of the entities or both. At this time, identification of LBD candidates is done by finding the pairs of related entities that are never mentioned in the same documents. 7) Relationship Identification involves a) examining documents where entities appear together b) when entities are LBD candidates, identify candidate linking B terms. This may be done navigating the graph looking for terms that are linked to both the A and the C terms. There are examples of this presented in the results section. Explaining why entities may be related is the primary area of future work planned by the author. In the next subsections, a small amount of detail is provided about some of the tools and API s used in this work. 3.1 Semantic Vectors Package A byproduct of Cohen s work to improve LSI [8] is the Semantic Vectors (SV) package. This open source java software initially written by Cohen s co-author, Dominic Widdows, and now maintained by he and many others under a Google code project - [18]. SV provides a library of capabilities that perform random indexing which performs much faster than SVD. SVD is an N x N problem where matrices will get to a size that current computing capabilities will now allow them to be computed. RI can do LSA-like analysis on millions of documents. For this work, terms are compared to identify relatedness. 3.2 Lucene indexing Lucene [6] is a set of java libraries that, among other things, allows for the searching of terms and phrases in sets of text documents or other representations of text like PDF s, HTML, Microsoft Word, etc. Lucene creates index files that contain the necessary information to not only find terms or phases quickly that may be contained in a corpus, it also is able to indicate where in the document the terms or phrases are. Lucene provides fast and efficient search capabilities. In the LBD solution presented here, Lucene indexes are created first and then the Semantic Vectors package is used to find candidate LBD pairs. Once pairs are found, the Lucene indexes are again referenced to find the documents in which entities are mentioned. 3.3 Graph database - neo4j Running SV with term comparisons across a set of documents discovers pairs of concepts that may be related. A graph database that represents node X, node Y and the link joining node X and Y is a good choice for storing results of SV or any other results of latent semantic analysis. Graph databases are sometimes called NoSQL databases since data is not stored in a traditional table structure. Neo4j is an open source light weight graph database [7]. For this project, nodes are the concepts (for example, A, B and C concepts) and the links are either an LBD link where the nodes on either side of the link are never mentioned in the same document or a shared document link where the nodes on either side are both mentioned in one or more documents. 3.4 Visualization - Gephi Once data is stored in a neo4j database, visualizing that data is important in order to assist in the analysis of results. Gephi is an open source graph visualization tool [5]. Martin Skurla created a neo4j plugin into Gephi during Google s summer of code, [4] Once a neo4j database is loaded into Gephi you can show only LBD relationships (links) or show only shared common document relationships. The graph (link) pictures shown in this paper were created from snapshots of Gephi displays of neo4j databases. 4. Experimentation and Results 4.1 Data Sets The current application allows flexibility in what steps of the LBD process are run and flexibility in what data are used. For example, MEDLINE data used in experiments presented here are pulled out of XML files using python scripts that place each abstract into a separate file on the filesystem. Data may come from, for example, an Enron corpus that may be in a database or may just be files on a filesystem [1], [3]. When data comes from database queries, the document data usually is read from large text fields or sets of fields - the system presented here places this data onto the filesystem - one file per document. MEDLINE is a National Library of Medicine database of biomedical literature that includes, among other things, the

4 Fig. 2: Sample of related pairs Fig. 3: Sample LBD Pairs Fig. 4: Sample Candidate B Terms titles and abstracts of over 21 million articles from over 5000 different journals and publications [2]. This data is stored in many XML files. For this effort, the python script used pulled each abstract for the years of interest and placed each into a file on the filesystem. Initial experimentation used all the abstracts from which simulates rough dates used by Don Swanson in his initial LBD work [16]. This date range created 692,382 total file documents. 4.2 Output MEDLINE data was analyzed by the system using approximately 190 candidate A and C concepts that were gathered from two papers - one that discussed fish oil and Raynaud s [11] and the other that discussed both fish oil - Raynaud s connection and migraines and magnesium connection [17]. During the processing of results, a neo4j database was created that ultimately allows for visual display of results. The Neo4j database contains nodes that represent documents and links that are named either an LBD relationship or a share-document-mention (SDM) relationship. If the relationship between two nodes is an LBD relationship, that means the nodes were never mentioned together in the same document. The system also creates summary results in a set of text output files. In order to keep the number of result pairs to consider a manageable size, an algorithm is currently used that returns either a hard number of results or a percentage of total pairs discovered. Some of the generated output files contain: 1) All related pairs - pairs that share mentions in documents and those that don t. The ones that don t will be LBD pairs. The SV score is also captured for each pair. See Figure 2 2) All LBD pairs - in order to discover the LBD pairs, the Lucene index is queried to find all documents in which the terms are mentioned. If there are no documents in common (bothaandcdocs = 0 in the LBD Pairs output), then this is an LBD pair. For preliminary work, only a maximum of 100 document matches is allowed for this result reporting. See Figure 3 3) All candidate linking terms - this file is the result of navigating the Neo4j graph database to find candidate B linking terms between the LBD pairs reported in the second list (All LBD Pairs). The first two terms are the LBD pair, the last is the candidate B term. See Figure 4

5 Fig. 5: Highlight Raynaud s and see LBD candidate pairs Fig. 6: Terms that share document references - A and C concepts, Lupus and Raynaud s, linked by B concept, Prostacyclin 4.3 Analysis In addition to the text file results, the Neo4j database loaded into Gephi, the graph visualization tool, provides help in visualizing what s in the data. In Figure 5, all of the medical concepts that had an LBD relationship with Raynaud s phenomenon are highlighted. Simply mousing over the Raynaud s node performs this highlighting. From this visualization, we see that Raynaud s phenomenon is related, via LBD, to Lupus, selective beta blockers and fish oil - again, it s LBD because the links shown are only the LBD candidate pairs. As in Swanson s work [16], fish oil and Raynaud s phenomenon are never mentioned in the same documents. In Figure 6, only links where the entities actually are mentioned in the same documents are shown. This particular view where prostacyclin is highlighted shows that prostacyclin may be a linking B concept between the A and C concepts - Lupus and Raynaud s phenomenon. In Figure 7, we can see that adrenergic alpha receptor is a concept relating selective beta blockers and Raynaud s phenomenon together. In this same diagram, we notice that fish oil had no candidate linking terms to Raynaud s - this is probably due to the small subset of A and C concepts analyzed in this early testing (as noted earlier, approximately 190 terms were used for candidate A and C concepts). 5. Conclusion I have presented an approach to discovering hidden knowledge in documents using a latent semantic analysis variant from the semantic vectors package. My approach discovers candidate

6 Fig. 7: Terms that share document references - A and C concepts, selective beta blockers and Raynaud s, linked by B concept, adrenergic alpha receptor. Note that no candidate term found between Raynaud s and fish oil. A and C concepts or terms which, although never mentioned in the same document, may be related. I have also discovered candidate linking or B terms that relate the A and C. To the best of my knowledge and research, identifying the linking B terms from the LSA results is something that has not previously been done. This system provides the platform on which alternative approaches may be tried to improve the quality of the discovered pairs. 6. Future I hope to complete the following major tasks. 6.1 Complete the Reproduction of Swanson Results In order to test the results of this system, I have successfully shown that fish oil and Raynaud s are semantically related and are never mentioned in the same documents in the corpus. Using the evaluation methodology described by Meliha Yetisgen-Yildiz and Wanda Pratt [20], the next step is to examine the corpus of documents after 1984 to see if the LDB candidates are mentioned in the same documents, thus, perhaps, proving that there is now a confirmed and important relationship between the LBD candidate pairs. There are other published results based on MEDLINE that use LBD to discover previously unknown related concepts. I may try to recreate those discoveries, also. 6.2 Making Operational Solution I plan to examine how analysts, scientists or other users might use the capabilities of the system presented in this paper. If I can refine the system so that fewer false positives are presented, users may find great benefit in the kind of discoveries that LBD presents - that is, the previously unknown and hidden knowledge that is in existing data. 6.3 Relationship Extraction - Discovering the Why I plan to take the results - the candidate LBD pairs and perhaps the linking B terms - and try to figure out why the concepts are related. That is, try to explain the relations. Initially, I plan to try relationship extraction approaches that have already been published with the hopes of refining these techniques to fit with that data and to provide more accurate results. 6.4 Open LBD The approach I have taken to date is a closed LBD solution where we know what A and C concepts to study. The next step is to expand this work to perform Open LBD where the C concepts are not known. To do this, candidate C concepts are discovered using natural language processing (NLP) namedentity (NE) extraction techniques to identify the kinds of concepts we are interested in. For example, when using medical domains, drugs, diseases, symptoms or side effects will be tagged by NLP NE extraction and then used as candidate C

7 concepts. When studying, for example, the Enron corpus, people, places, organizations or events will be tagged. Acknowledgments Work described in this paper was funded by Intelligent Software Solutions as an internal research and development project (IRAD). The system described in the paper is patent pending, all rights reserved. References [1] Enron data set. enron/. [2] Medline/pubmed data files. or [3] Mysql enron database. adibi/enron/enron.htm. [4] Plugin for visualizing neo4j graphs in gephi, [5] Gephi graph visualization tool, [6] Lucene, [7] Neo4j - an open source graph databases. website, [8] Trevor Cohen, Roger Schvaneveldt, and Dominic Widdows. Reflective random indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics, 43(2): , [9] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41: , [10] Murat C. Ganiz, William M. Pottenger, and Christopher D. Janneck. Recent advances in literature based discovery. Technical Report LU- CSE , Lehigh University, [11] Michael D. Gordon and Susan Dumais. Using latent semantic indexing for literature based discovery. J. Am. Soc. Inf. Sci., 49: , June [12] Pentti Kanerva. Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation, 1: , /s [13] Joel A Solka Jeffrey L Briggs Michael B Rushenberg Robert L Stump Jesse A Johnson Dustin Lyons Terence J Wyatt Jeffrey R Kostoff, Ronald N Block. Literature-related discovery (lrd). Technical Report ADA473438, OFFICE OF NAVAL RESEARCH, November [14] Robert K. Lindsay and Michael D. Gordon. Literature-based discovery by lexical statistics. J. Am. Soc. Inf. Sci., 50: , May [15] Aditya Kumar Sehgal. Profiling topics on the Web for knowledge discovery. Phd diss, University of Iowa, [16] D. R. Swanson. Fish oil, raynaud s syndrome, and undiscovered public knowledge. Perspect Biol Med, 30(1):7 18, 1986 Autumn. [17] Marc Weeber, Henny Klein, Lolkje T.W. de Jong-van den Berg, and Rein Vos. Using concepts in literature-based discovery: Simulating swanson s raynaud-fish oil and migraine-magnesium discoveries. Journal of the American Society for Information Science and Technology, 52(7): , [18] Dominic Widdows et al. Semantic vectors package, [19] M. Yetisgen-Yildiz and W. Pratt. Using statistical and knowledge based approaches for literature-based discovery. Journal of Biomedical Informatics, 39(6): , [20] Meliha Yetisgen-Yildiz and Wanda Pratt. A new evaluation methodology for literature-based discovery systems. Journal of Biomedical Informatics, 42(4): , 2009.

Supporting Discovery in Medicine by Association Rule Mining in Medline and UMLS

Supporting Discovery in Medicine by Association Rule Mining in Medline and UMLS Supporting Discovery in Medicine by Association Rule Mining in Medline and UMLS Dimitar Hristovski a, Janez Stare a, Borut Peterlin b, Saso Dzeroski c a IBMI, Medical Faculty, University of Ljubljana,

More information

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction Organizing Internet Bookmarks using Latent Semantic Analysis and Intelligent Icons Note: This file is a homework produced by two students for UCR CS235, Spring 06. In order to fully appreacate it, it may

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS 1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,

More information

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ - 1 - ITERATIVE SEARCHING IN AN ONLINE DATABASE Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ 07962-1910 ABSTRACT An experiment examined how people use

More information

CrossRef text and data mining services

CrossRef text and data mining services pissn 2288-8063 eissn 2288-7474 Sci Ed 2015;2(1):22-27 http://dx.doi.org/10.6087/kcse.32 Training Material CrossRef text and data mining services Rachael Lammey CrossRef, Oxford, United Kingdom Abstract

More information

Document Retrieval using Predication Similarity

Document Retrieval using Predication Similarity Document Retrieval using Predication Similarity Kalpa Gunaratna 1 Kno.e.sis Center, Wright State University, Dayton, OH 45435 USA kalpa@knoesis.org Abstract. Document retrieval has been an important research

More information

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

Information from Semantic Integration of Texts and Databases

Information from Semantic Integration of Texts and Databases Information from Semantic Integration of Texts and Databases Erik M. van Mulligen 1, Wytze J. Vlietstra 1, Rein Vos 1,2, Jan A. Kors 1 1 Erasmus University Medical Center, Rotterdam, The Netherlands {e.vanmulligen,

More information

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli, danzigg@cs.usc.edu

More information

INTERNATIONAL CONFERENCE ON ENGINEERING DESIGN ICED 01 GLASGOW, AUGUST 21-23, 2001

INTERNATIONAL CONFERENCE ON ENGINEERING DESIGN ICED 01 GLASGOW, AUGUST 21-23, 2001 INTERNATIONAL CONFERENCE ON ENGINEERING DESIGN ICED 01 GLASGOW, AUGUST 21-23, 2001 AUTOMATIC COMPOSITION OF XML DOCUMENTS TO EXPRESS DESIGN INFORMATION NEEDS Andy Dong, Shuang Song, Jialong Wu, and Alice

More information

Clustering Startups Based on Customer-Value Proposition

Clustering Startups Based on Customer-Value Proposition Clustering Startups Based on Customer-Value Proposition Daniel Semeniuta Stanford University dsemeniu@stanford.edu Meeran Ismail Stanford University meeran@stanford.edu Abstract K-means clustering is a

More information

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072

More information

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 43 CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 3.1 INTRODUCTION This chapter emphasizes the Information Retrieval based on Query Expansion (QE) and Latent Semantic

More information

Decomposition. November 20, Abstract. With the electronic storage of documents comes the possibility of

Decomposition. November 20, Abstract. With the electronic storage of documents comes the possibility of Latent Semantic Indexing via a Semi-Discrete Matrix Decomposition Tamara G. Kolda and Dianne P. O'Leary y November, 1996 Abstract With the electronic storage of documents comes the possibility of building

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009 Maximizing the Value of STM Content through Semantic Enrichment Frank Stumpf December 1, 2009 What is Semantics and Semantic Processing? Content Knowledge Framework Technology Framework Search Text Images

More information

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD 10 Text Mining Munawar, PhD Definition Text mining also is known as Text Data Mining (TDM) and Knowledge Discovery in Textual Database (KDT).[1] A process of identifying novel information from a collection

More information

Turning Text into Insight: Text Mining in the Life Sciences WHITEPAPER

Turning Text into Insight: Text Mining in the Life Sciences WHITEPAPER Turning Text into Insight: Text Mining in the Life Sciences WHITEPAPER According to The STM Report (2015), 2.5 million peer-reviewed articles are published in scholarly journals each year. 1 PubMed contains

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

TURNING TEXT INTO INSIGHT: TEXT MINING IN THE LIFE SCIENCES

TURNING TEXT INTO INSIGHT: TEXT MINING IN THE LIFE SCIENCES TURNING TEXT INTO INSIGHT: TEXT MINING IN THE LIFE SCIENCES According to The STM Report (2015), 2.5 million peer-reviewed articles are published in scholarly journals each year. 1 PubMed contains more

More information

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Effective Latent Space Graph-based Re-ranking Model with Global Consistency Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

Genescene: Biomedical Text and Data Mining

Genescene: Biomedical Text and Data Mining Claremont Colleges Scholarship @ Claremont CGU Faculty Publications and Research CGU Faculty Scholarship 5-1-2003 Genescene: Biomedical Text and Data Mining Gondy Leroy Claremont Graduate University Hsinchun

More information

What is this Song About?: Identification of Keywords in Bollywood Lyrics

What is this Song About?: Identification of Keywords in Bollywood Lyrics What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics

More information

Historical Text Mining:

Historical Text Mining: Historical Text Mining Historical Text Mining, and Historical Text Mining: Challenges and Opportunities Dr. Robert Sanderson Dept. of Computer Science University of Liverpool azaroth@liv.ac.uk http://www.csc.liv.ac.uk/~azaroth/

More information

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM Ms.Susan Geethu.D.K 1, Ms. R.Subha 2, Dr.S.Palaniswami 3 1, 2 Assistant Professor 1,2 Department of Computer Science and Engineering, Sri Krishna

More information

PubMed Assistant: A Biologist-Friendly Interface for Enhanced PubMed Search

PubMed Assistant: A Biologist-Friendly Interface for Enhanced PubMed Search Bioinformatics (2006), accepted. PubMed Assistant: A Biologist-Friendly Interface for Enhanced PubMed Search Jing Ding Department of Electrical and Computer Engineering, Iowa State University, Ames, IA

More information

Automated Classification. Lars Marius Garshol Topic Maps

Automated Classification. Lars Marius Garshol Topic Maps Automated Classification Lars Marius Garshol Topic Maps 2007 2007-03-21 Automated classification What is it? Why do it? 2 What is automated classification? Create parts of a topic map

More information

Ruslan Salakhutdinov and Geoffrey Hinton. University of Toronto, Machine Learning Group IRGM Workshop July 2007

Ruslan Salakhutdinov and Geoffrey Hinton. University of Toronto, Machine Learning Group IRGM Workshop July 2007 SEMANIC HASHING Ruslan Salakhutdinov and Geoffrey Hinton University of oronto, Machine Learning Group IRGM orkshop July 2007 Existing Methods One of the most popular and widely used in practice algorithms

More information

Quick Reference Guide

Quick Reference Guide Quick Reference Guide www.scopus.com Scopus is the largest abstract and citation database of peer-reviewed literature with bibliometrics tools to track, analyze and visualize research. It contains over,000

More information

Medline. Library Services

Medline. Library Services Library Services Medline Medline (produced by the U.S. National Library of Medicine) is widely recognised as the premier source of bibliographic information for health and biomedical literature. It covers

More information

American Institute of Physics

American Institute of Physics American Institute of Physics (http://journals.aip.org/)* Founded in 1931, the American Institute of Physics (AIP) is a not-for-profit scholarly society established for the purpose of promoting the advancement

More information

Bibliometrics: Citation Analysis

Bibliometrics: Citation Analysis Bibliometrics: Citation Analysis Many standard documents include bibliographies (or references), explicit citations to other previously published documents. Now, if you consider citations as links, academic

More information

The Constellation Project. Andrew W. Nash 14 November 2016

The Constellation Project. Andrew W. Nash 14 November 2016 The Constellation Project Andrew W. Nash 14 November 2016 The Constellation Project: Representing a High Performance File System as a Graph for Analysis The Titan supercomputer utilizes high performance

More information

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E Powering Knowledge Discovery Insights from big data with Linguamatics I2E Gain actionable insights from unstructured data The world now generates an overwhelming amount of data, most of it written in natural

More information

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms Yikun Guo, Henk Harkema, Rob Gaizauskas University of Sheffield, UK {guo, harkema, gaizauskas}@dcs.shef.ac.uk

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Scuola di dottorato in Scienze molecolari Information literacy in chemistry 2015 SCOPUS

Scuola di dottorato in Scienze molecolari Information literacy in chemistry 2015 SCOPUS SCOPUS ORIGINAL RESEARCH INFORMATION IN SCIENCE is published (stored) in PRIMARY LITERATURE it refers to the first place a scientist will communicate to the general audience in a publicly accessible document

More information

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS KULWADEE SOMBOONVIWAT Graduate School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033,

More information

In the previous lecture we went over the process of building a search. We identified the major concepts of a topic. We used Boolean to define the

In the previous lecture we went over the process of building a search. We identified the major concepts of a topic. We used Boolean to define the In the previous lecture we went over the process of building a search. We identified the major concepts of a topic. We used Boolean to define the relationships between concepts. And we discussed common

More information

New Approach to Graph Databases

New Approach to Graph Databases Paper PP05 New Approach to Graph Databases Anna Berg, Capish, Malmö, Sweden Henrik Drews, Capish, Malmö, Sweden Catharina Dahlbo, Capish, Malmö, Sweden ABSTRACT Graph databases have, during the past few

More information

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL Shwetha S P 1 and Alok Ranjan 2 Visvesvaraya Technological University, Belgaum, Dept. of Computer Science and Engineering, Canara

More information

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Sungbin Choi, Jinwook Choi Medical Informatics Laboratory, Seoul National University, Seoul, Republic of

More information

Scopus. Quick Reference Guide

Scopus. Quick Reference Guide Scopus Quick Reference Guide Quick Reference Guide An eye on global research. Scopus is the largest abstract and citation database of peer-reviewed literature, with bibliometrics tools to track, analyze

More information

The Curated Web: A Recommendation Challenge. Saaya, Zurina; Rafter, Rachael; Schaal, Markus; Smyth, Barry. RecSys 13, Hong Kong, China

The Curated Web: A Recommendation Challenge. Saaya, Zurina; Rafter, Rachael; Schaal, Markus; Smyth, Barry. RecSys 13, Hong Kong, China Provided by the author(s) and University College Dublin Library in accordance with publisher policies. Please cite the published version when available. Title The Curated Web: A Recommendation Challenge

More information

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points? Ranked Retrieval One option is to average the precision scores at discrete Precision 100% 0% More junk 100% Everything points on the ROC curve But which points? Recall We want to evaluate the system, not

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

An UIMA based Tool Suite for Semantic Text Processing

An UIMA based Tool Suite for Semantic Text Processing An UIMA based Tool Suite for Semantic Text Processing Katrin Tomanek, Ekaterina Buyko, Udo Hahn Jena University Language & Information Engineering Lab StemNet Knowledge Management for Immunology in life

More information

Essential Dimensions of Latent Semantic Indexing (LSI)

Essential Dimensions of Latent Semantic Indexing (LSI) Essential Dimensions of Latent Semantic Indexing (LSI) April Kontostathis Department of Mathematics and Computer Science Ursinus College Collegeville, PA 19426 Email: akontostathis@ursinus.edu Abstract

More information

BioNav: An Ontology-Based Framework to Discover Semantic Links in the Cloud of Linked Data

BioNav: An Ontology-Based Framework to Discover Semantic Links in the Cloud of Linked Data BioNav: An Ontology-Based Framework to Discover Semantic Links in the Cloud of Linked Data María-Esther Vidal 1, Louiqa Raschid 2, Natalia Márquez 1, Jean Carlo Rivera 1, and Edna Ruckhaus 1 1 Universidad

More information

Overview of Web Mining Techniques and its Application towards Web

Overview of Web Mining Techniques and its Application towards Web Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous

More information

How SPICE Language Modeling Works

How SPICE Language Modeling Works How SPICE Language Modeling Works Abstract Enhancement of the Language Model is a first step towards enhancing the performance of an Automatic Speech Recognition system. This report describes an integrated

More information

Contextual Search using Cognitive Discovery Capabilities

Contextual Search using Cognitive Discovery Capabilities Contextual Search using Cognitive Discovery Capabilities In this exercise, you will work with a sample application that uses the Watson Discovery service API s for cognitive search use cases. Discovery

More information

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Query Processing and Alternative Search Structures. Indexing common words

Query Processing and Alternative Search Structures. Indexing common words Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such

More information

A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning

A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning Yasushi Kiyoki, Takashi Kitagawa and Takanari Hayama Institute of Information Sciences and Electronics University of Tsukuba

More information

Searching the Evidence in PubMed

Searching the Evidence in PubMed CAMBRIDGE UNIVERSITY LIBRARY MEDICAL LIBRARY Supporting Literature Searching Searching the Evidence in PubMed July 2017 Supporting Literature Searching Searching the Evidence in PubMed How to access PubMed

More information

SciVerse Scopus. 1. Scopus introduction and content coverage. 2. Scopus in comparison with Web of Science. 3. Basic functionalities of Scopus

SciVerse Scopus. 1. Scopus introduction and content coverage. 2. Scopus in comparison with Web of Science. 3. Basic functionalities of Scopus Prepared by: Jawad Sayadi Account Manager, United Kingdom Elsevier BV Radarweg 29 1043 NX Amsterdam The Netherlands J.Sayadi@elsevier.com SciVerse Scopus SciVerse Scopus 1. Scopus introduction and content

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

Clinical Database applications in hospital

Clinical Database applications in hospital Clinical Database applications in hospital Mo Sun, Ye Lin, Roger Yim Lee sun2m, lin2y, lee1ry@cmich.edu Department of Computer Science Central Michigan University Abstract Database applications are used

More information

Clustered SVD strategies in latent semantic indexing q

Clustered SVD strategies in latent semantic indexing q Information Processing and Management 41 (5) 151 163 www.elsevier.com/locate/infoproman Clustered SVD strategies in latent semantic indexing q Jing Gao, Jun Zhang * Laboratory for High Performance Scientific

More information

Database Management System Prof. Partha Pratim Das Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Database Management System Prof. Partha Pratim Das Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Database Management System Prof. Partha Pratim Das Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture - 19 Relational Database Design (Contd.) Welcome to module

More information

Geosemantically-enhanced PubMed Queries Using the Geonames Ontology and Web Services

Geosemantically-enhanced PubMed Queries Using the Geonames Ontology and Web Services Geosemantically-enhanced PubMed Queries Using the Geonames Ontology and Web Services Maged N. Kamel Boulos, PhD, MSc, MBBCh Plymouth University, UK mnkboulos@ieee.org Agenda About PubMed and MeSH The Problem

More information

Information Retrieval. hussein suleman uct cs

Information Retrieval. hussein suleman uct cs Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information

More information

Renae Barger, Executive Director NN/LM Middle Atlantic Region

Renae Barger, Executive Director NN/LM Middle Atlantic Region Renae Barger, Executive Director NN/LM Middle Atlantic Region rbarger@pitt.edu http://nnlm.gov/mar/ DANJ Meeting, November 4, 2011 Advanced PubMed (20 min) General Information PubMed Citation Types Automatic

More information

Searching the Evidence using EBSCOHost

Searching the Evidence using EBSCOHost CAMBRIDGE UNIVERSITY LIBRARY MEDICAL LIBRARY Supporting Literature Searching Searching the Evidence using EBSCOHost ATHENS CINAHL Use to search CINAHL with an NHS ATHENS login (or PsycINFO with University

More information

Efficient Mining Algorithms for Large-scale Graphs

Efficient Mining Algorithms for Large-scale Graphs Efficient Mining Algorithms for Large-scale Graphs Yasunari Kishimoto, Hiroaki Shiokawa, Yasuhiro Fujiwara, and Makoto Onizuka Abstract This article describes efficient graph mining algorithms designed

More information

A MODULAR APPROACH TO DOCUMENT INDEXING AND SEMANTIC SEARCH

A MODULAR APPROACH TO DOCUMENT INDEXING AND SEMANTIC SEARCH A MODULAR APPROACH TO DOCUMENT INDEXING AND SEMANTIC SEARCH Dhanya Ravishankar, Krishnaprasad Thirunarayan, and Trivikram Immaneni Department of Computer Science and Engineering Wright State University,

More information

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Kwangcheol Shin 1, Sang-Yong Han 1, and Alexander Gelbukh 1,2 1 Computer Science and Engineering Department, Chung-Ang University,

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics An Oracle White Paper October 2012 Oracle Social Cloud Platform Text Analytics Executive Overview Oracle s social cloud text analytics platform is able to process unstructured text-based conversations

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

The Goal of this Document. Where to Start?

The Goal of this Document. Where to Start? A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Zhe Wang Princeton University Jim Gemmell Microsoft Research September 2005 Technical Report MSR-TR-2006-30 Microsoft Research Microsoft

More information

Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction

Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction Pavel P. Kuksa, Rutgers University Yanjun Qi, Bing Bai, Ronan Collobert, NEC Labs Jason Weston, Google Research NY Vladimir

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

How to Work with a Reference Answer Set

How to Work with a Reference Answer Set How to Work with a Reference Answer Set Easily identify and isolate references of interest Quickly retrieve relevant information from the world s largest, publicly available reference database for chemistry

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab M. Duwairi* and Khaleifah Al.jada'** * Department of Computer Information Systems, Jordan University of Science and Technology, Irbid 22110,

More information

Case Study on Testing of Web-Based Application: Del s Students Information System

Case Study on Testing of Web-Based Application: Del s Students Information System Case Study on Testing of Web-Based Application: Del s Students Information System Arnaldo Marulitua Sinaga Del Institute of Technology, North Sumatera, Indonesia. aldo@del.ac.id Abstract Software Testing

More information

The Semantic Conference Organizer

The Semantic Conference Organizer 34 The Semantic Conference Organizer Kevin Heinrich, Michael W. Berry, Jack J. Dongarra, Sathish Vadhiyar University of Tennessee, Knoxville, USA CONTENTS 34.1 Background... 571 34.2 Latent Semantic Indexing...

More information

A Content Based Image Retrieval System Based on Color Features

A Content Based Image Retrieval System Based on Color Features A Content Based Image Retrieval System Based on Features Irena Valova, University of Rousse Angel Kanchev, Department of Computer Systems and Technologies, Rousse, Bulgaria, Irena@ecs.ru.acad.bg Boris

More information

Text Modeling with the Trace Norm

Text Modeling with the Trace Norm Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to

More information

Exploiting Sensitive Information in Background Mode using Latent Semantic Indexing

Exploiting Sensitive Information in Background Mode using Latent Semantic Indexing Exploiting Sensitive Information in Background Mode using Latent Semantic Indexing R. B. Bradford Agilex Technologies Inc, Chantilly, Virginia r.bradford@agilex.com Abstract Access to specific information

More information

Popularity of Twitter Accounts: PageRank on a Social Network

Popularity of Twitter Accounts: PageRank on a Social Network Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,

More information

Content-based Dimensionality Reduction for Recommender Systems

Content-based Dimensionality Reduction for Recommender Systems Content-based Dimensionality Reduction for Recommender Systems Panagiotis Symeonidis Aristotle University, Department of Informatics, Thessaloniki 54124, Greece symeon@csd.auth.gr Abstract. Recommender

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

Retrieval of Highly Related Documents Containing Gene-Disease Association

Retrieval of Highly Related Documents Containing Gene-Disease Association Retrieval of Highly Related Documents Containing Gene-Disease Association K. Santhosh kumar 1, P. Sudhakar 2 Department of Computer Science & Engineering Annamalai University Annamalai Nagar, India. santhosh09539@gmail.com,

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Extractive Text Summarization Techniques

Extractive Text Summarization Techniques Extractive Text Summarization Techniques Tobias Elßner Hauptseminar NLP Tools 06.02.2018 Tobias Elßner Extractive Text Summarization Overview Rough classification (Gupta and Lehal (2010)): Supervised vs.

More information

LSA-like models Bill Freeman June 21, 2004

LSA-like models Bill Freeman June 21, 2004 LSA-like models Bill Freeman June 21, 2004 1 Introduction LSA is simple and stupid, but works pretty well for analyzing the meanings of words and text, given a large, unlabelled training set. Why? LSA

More information

e-scider: A tool to retrieve, prioritize and analyze the articles from PubMed database Sujit R. Tangadpalliwar 1, Rakesh Nimbalkar 2, Prabha Garg* 3

e-scider: A tool to retrieve, prioritize and analyze the articles from PubMed database Sujit R. Tangadpalliwar 1, Rakesh Nimbalkar 2, Prabha Garg* 3 e-scider: A tool to retrieve, prioritize and analyze the articles from PubMed database Sujit R. Tangadpalliwar 1, Rakesh Nimbalkar 2, Prabha Garg* 3 1 National Institute of Pharmaceutical Education and

More information

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF Julia Jürgens, Sebastian Kastner, Christa Womser-Hacker, and Thomas Mandl University of Hildesheim,

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab Duwairi Dept. of Computer Information Systems Jordan Univ. of Sc. and Technology Irbid, Jordan rehab@just.edu.jo Khaleifah Al.jada' Dept. of

More information