Semantic Text Mining and its Application in Biomedical Domain. A Thesis. Submitted to the Faculty. Drexel University. Illhoi Yoo

Size: px

Start display at page:

Download "Semantic Text Mining and its Application in Biomedical Domain. A Thesis. Submitted to the Faculty. Drexel University. Illhoi Yoo"

Francine Hunter
6 years ago
Views:

1 Semantic Text Mining and its Application in Biomedical Domain A Thesis Submitted to the Faculty of Drexel University by Illhoi Yoo in partial fulfillment of the requirements for the degree of Doctor of Philosophy June 2006

4 i ACKNOWLEDGEMENTS I am indebted to many people for their support and advice to the successful completion of my Ph.D degree and this dissertation. My deepest gratitude goes to my supervisor, Dr. Xiaohua Hu, for his guidance and assistance with this dissertation as well as all the research during my doctoral research endeavor for the past four years. He has helped me to move forward with investigation in-depth and to remain focused on achieving my goal. I am grateful to my committee members, Dr. Il-Yeol Song, Dr. Xia Lin, Dr. Bahrad A. Sokhansanj, and Dr. Don Goelman, for their invaluable advice and suggestions. Especially, Dr. Song has always been meticulous in proofreading my research papers. His advice on both academic and non-academic matters has been inestimable. I would like to express my appreciation to my parents, SungTae Yoo and SunJa Park, and to my parents-in-law, TaeWhan Jung and SoonAe Goo, for their love, support and encouragement. I would like to express my sincere thanks to my wife YoungJae Jung for her love and sacrifice. Without her constant sacrifice, this thesis would not have been possible. I dedicate this thesis to my two little sons, William and Jason, with love. Finally, the research works relevant to this thesis have been supported in part from the NSF Career grant (NSF IIS ) NSF CCF and the PA Dept of Health Tobacco Settlement Formula Grant (#240205, ).

5 ii TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES.. ABSTRACT. vi ix xi CHAPTER 1: INTRODUCTION Research Questions Contributions of the Thesis Organization of Thesis. 9 CHAPTER 2: RELATED WORK Ontologies Medical Subject Headings (MeSH) Unified Medical Language System (UMLS) Gene Ontology (GO) Vector Space Representation of Documents Document Clustering Text Summarization Swanson s ABC model 21 CHAPTER 3: SEMANTIC TEXT MINING AND ITS APPLICATION IN BIOMEDICAL DOMAIN Graphical Representations of Documents Document Clustering using Scale-free Graphical Representation Integration of Individual Graphs into Corpus-level Graphical Representation 31

6 iii Graph Clustering for a Graphical Representation of Documents Model-based Document Assignment Text Summarization Making Ontology-enriched Graphical Representations for Each Sentence Constructing Text Semantic Interaction Network (TSIN) Selecting Significant Text Contents for Summary A Semantic Version of Swanson s ABC model The Algorithm Bio-SbKDS MeSH Term Qualification Bi-Decision Maker Combinational Search Method Document Clustering using Bipartite Graph Representation (COBRA) Bipartite Graphical Representation for Documents through Concept Mapping Initial Clustering by Combining Co-Occurrence Concepts Mutual Refinement Strategy for Document Clustering. 71 CHAPTER 4: EXPERIMENTAL EVALUATION Document Clustering using Scale-free Graphical Representation Document Sets Evaluation Method Experimental Setting Experiment Results Text Summarization Swanson s ABC model 90

7 iv Raynaud Disease Fish Oils Migraine Magnesium Document Clustering using Bipartite Graph Representation (COBRA) Document Sets Evaluation Method Experimental Setting Evaluation Results Comparison of Nine Document Clustering Approaches Evaluation of Mutual Refinement Strategy and Use of Cooccurrence Concepts Comparison of Traditional Document Clustering Approaches Experiments Experiment Results. 124 CHAPTER 5: CONCLUSION AND FUTURE STUDIES Graphical Representation Method A Coherent Document Clustering and Text Summarization A Comprehensive Comparison Study of Document Clustering A Semantic Version of Swanson s ABC Model Document Clustering using Bipartite Graph Representation (COBRA) Explicit Answers to the Four Research Questions What advantages does the graphical representation method for documents have over the traditional vector space representation method? How much ontologies can improve text mining results?

8 v How stable does the semantic text mining approach output for datasets? Does the text mining performance heavily depend on datasets? How scalable is the semantic text mining system, compared with traditional text mining system? Future Studies Possible Enhancements More Semantic Text Mining Components 146 LIST OF REFERENCES VITA. 156

9 vi LIST OF TABLES Table 1: Difference between TM, DM, IR, IE, and Database Query.. 3 Table 2: Selected Biomedical Ontologies 11 Table 3: HVS and non-hvs for Sample Graph Clusters. 41 Table 4: Semantic Relations for some semantic types. 52 Table 5: Relation Filter between C concept and B concepts 52 Table 6: Semantic Relations for some semantic types. 52 Table 7: Extended semantic types through tracking ISA relations 54 Table 8: The Semantic Types that have no relationship Table 9: The Relation Filter between A concepts and B concepts 55 Table 10: The Semantic Types as Category Restrictions for B Concepts and A Concepts 56 Table 11: TOP 5 bridge concepts with their counts.. 58 Table 12: The Definitions of MeSH terms 61 Table 13: The Combination Search Keywords and their Weights 62 Table 14: The Document Sets and Their Sizes. 75 Table 15: List of Test Corpora Generated from the Base Data Sets. 76 Table 16: Sample Classes and Clustering Output. Each number in the table is the number of objects in its class or cluster.. 78 Table 17: Summary of Overall Experiment Results on MEDLINE Document Sets 82 Table 18: Experiment Results for Text Summarization: For the Alzheimer Disease document cluster its document cluster model and key sentences as summary are shown.. 87

10 vii Table 19: Experiment Results for Text Summarization: For the Parkinson Disease document cluster its document cluster model and key sentences as summary are shown.. 88 Table 20: Experiment Results for Text Summarization: For the Osteoarthritis document cluster its document cluster model and key sentences as summary are shown 89 Table 21: Experiment Results of the two problems (# of B=3 vs. # of B=5) Table 22: Search Keywords against Medline Table 23: The Degrees of the Relationships between Raynaud Disease and A Concepts derived.. 93 Table 24: LSI (Raynaud Disease Fish Oil) 94 Table 25: Association rule (Raynaud Disease Fish Oil) 96 Table 26: Experimental Results of the two problems (# of B=3 vs. # of B=5) 98 Table 27: Search Keywords against Medline Table 28: The Degrees of the Relationships between Migraine and A Concepts derived Table 29: LSI (Migraine Magnesium) Table 30: Association rule (Migraine Magnesium) Table 31: List of Test Corpora Generated from the Base Data Sets 104 Table 32: Comparison of MIs for COBRA and the eight clustering approaches. 108 Table 33: Comparison of Entropy for COBRA and the eight clustering approaches 109 Table 34: Comparison of F-measure for COBRA and the eight clustering approaches 110 Table 35: Comparison of Purity for COBRA and the eight clustering approaches 111 Table 36: Relative Superiority of COBRA to the Traditional Approaches in Terms of Cluster Quality and Clustering Reliability 113

11 viii Table 37: The Most Contributing Significant Semantic Features to Hay Fever Document Cluster. 114 Table 38: The Most Contributing Significant Semantic Features to Osteoarthritis Document Cluster Table 39: The Most Contributing Significant Semantic Features to Kidney Calculi (stones) Document Cluster. 116 Table 40: Overall Improvements of COBRA through Mutual Refinement Strategy (MRS) and the Use of Co-occurrence Concepts 118 Table 41: Overview of Test Corpora 122 Table 42: Document Clustering Methods and Clustering Options to be evaluated Table 43: Comparison of Clustering Evaluation Metrics and Running Times for the Two Cluster Selection Methods of BiSecting K-means. 126 Table 44: Comparison of Clustering Evaluation Metrics and Running Times for Bisecting K-means and K-means. 127 Table 45: Comparison of Clustering Evaluation Metrics and Running Times for STC and Hierarchical algorithms on the smallest six datasets (due to the scalability problem) Table 46: Comparison of Evaluation Metrics and Running Times for STC and Partitional algorithms on the smallest twenty four datasets (due to the scalability problem of STC). 129 Table 47: Comparison of Clustering Evaluation Metrics and Running Times for Hierarchical and Partitional algorithms 131 Table 48: Cluster Quality Improvement Using Ontology for Hierarchical (a), STC (b), Bisecting K-means (c), and K-means (d) and Overall Clustering Improvement (e) 134 Table 49: The Correlations between the Four Cluster Evaluation Metrics (MI, F-measure, Purity, and Entropy).. 136

12 ix LIST OF FIGURES Figure 1: The Exploding Number of MEDLINE Articles over Years... 1 Figure 2: For the MeSH Descriptor of Neoplasms and its definition and Entry Terms are provided Figure 3: A part of MeSH Tree 13 Figure 4: An illustrative example of the UMLS.. 15 Figure 5: Swanson s ABC model for UDPK 22 Figure 6: The Concept Mapping from MeSH Entry Terms to MeSH Descriptors 28 Figure 7: Individual graphical representations for each document Figure 8: Integration of individual graphs 32 Figure 9: A graphical representation of a document set as a scale-free network. This graph is from a test corpus that consists of 21,977 documents and has 9 classes 33 Figure 10: The Flow of Scale-Free Graph Clustering (SFGC) Algorithm Figure 11: Pseducode of Scale-Free Graph Clustering (SFGC) Algorithm. 36 Figure 12: Two sample graphical document cluster models from the corpuslevel graphical representation in Figure 9 40 Figure 13: Edit Distance between Two Graphical Representations of D 1 and D 2 44 Figure 14: The Data Flow of Bio-SbKDS 50 Figure 15: The Counts of MeSH Terms Assigned to MEDLINE Articles. 59 Figure 16: A Sample Bipartite Graph between Documents and Corpus-level Co-occurrence Concepts.. 67 Figure 17: The Initial Clustering Algorithm 70

13 x Figure 18: The Mutual Refinement Strategy Algorithm.. 73 Figure 19: Comparison of MIs for COGR and Traditional Document Approaches 84 Figure 20: The Scalabilities of Bisecting K-means, K-means, Hierarchical algorithms, and STC on Different Sizes of Sample Datasets

14 xi Abstract Semantic Text Mining and its Application in Biomedical Domain Illhoi Yoo Xiaohua Hu, Ph.D A huge amount of biomedical knowledge and novel discoveries have been produced and collected in text databases or digital libraries, such as MEDLINE, because the most natural form to store information is text. In order to cope with this pressing text information overload, text mining is employed. However, traditional text mining approaches have several problems, such as the use of the vector representation for documents. In this thesis, we introduce a semantic text mining approach that can overcome the traditional problems. This approach consists of important text mining components. Those components are graphical representation method for documents that relies on domain ontologies, document clustering taking advantage of the scale-free network theory to mine the corpus-level graphical representation, text summarization, and a semantic version of Swanson s ABC model. The primary contributions of this dissertation are four-fold. First we introduce graphical representation method for documents that take advantage of domain ontology. Second, the semantic document clustering approach is unique in that it provides users with document cluster models from an ontology-enriched scale-free representation of a set of documents, which are the summaries for each document cluster, and which also explain document categorization. Third, in order to maximize the usefulness of document clustering, we introduce a text summarization approach that makes use of document cluster models. Finally, we introduce a semantic way to generate reasonable hypotheses based on evidence from biomedical literature using the complementary structures in disjoint literatures.

15 xii

16 1 CHAPTER 1: INTRODUCTION A huge amount of biomedical knowledge and novel discoveries have been produced and collected in text databases or digital libraries, such as MEDLINE [NLM, 2006], PubMed Central 1 [PubMed Central], BioMed Central 2 [BioMed Central], etc for decades because the most natural form to store information is text. For example, MEDLINE, the largest biomedical bibliographic text database, has more than 16 million articles and more than 10,000 articles are added weekly to MEDLINE. Figure 1 shows the exploding volume of biomedical literature in MEDLINE over the past 57 years 3, which makes it difficult to locate and manage the public biomedical information. MEDLINE Size (# of articles) 16,000,000 14,000,000 12,000,000 10,000,000 8,000,000 6,000,000 4,000,000 2,000, Year Figure 1: The Exploding Number of MEDLINE Articles over Years 2006 (Apr.) 1 PubMed Central offers free access to full-text of a few hundred thousand journal articles while PubMed provides abstracts for millions of articles from thousands of journals. 2 BioMed Central as a commercial publisher of online biomedical journals offers free access to full-text articles while PubMed Central and PubMed are digital archives at the U.S. National Institutes of Health (NIH), developed and managed by National Library of Medicine (NLM). 3 The data was retrieved from PubMed, using dp" keyword, which stands for Date of Publication (e.g., 2000 [dp] to know the number of articles registered to MEDLINE during 2000.)

17 2 While the biomedical literature provides us with the full descriptions of novel discoveries and information, it does not supply them in a structured format to depict a predefined interpretation of them. In addition, because these unstructured data (i.e. text) normally lacks metadata (structured information about data), there is no standard means to facilitate and improve the retrieval of information and further the text analysis [Karanikas and Theodoulidis, 2002]. Thus, it is very hard to keep up to date with biomedical novel discoveries and information even within one s narrow field of research and to take advantage of them for their management and, further, knowledge discovery from them. In order to facilitate the use of hidden biomedical information/knowledge in ever-growing text, we should overcome the ever-demanding challenge of extracting knowledge from text [Hu et al, 2006a] [Hu et al, 2006b] [Hu et al, 2005a] [Hu et al, 2005b]. In order to cope with this pressing text information overload, text mining 4 is employed. Text mining has been defined as the non-trivial discovery process for identifying novel patterns in unstructured text [Fan et al, 2005] [Mooney and Nahm, 2003] [Karanikas and Theodoulidis, 2002]. Text mining (TM) as a new research area has been advanced by using techniques from information retrieval (IR), information extraction (IE) including natural language processing (NLP), data mining (DM) including machine learning (ML), etc [Spasic et al, 2005] [Karanikas and Theodoulidis, 2002] [Sullivan, 2001] [Rajman and Besan, 1997]. However, TM is different from (1) DM in 4 Text mining [Fan et al, 2005] [Spasic et al, 2005] [Mooney and Nahm, 2003] [Karanikas and Theodoulidis, 2002] [Sullivan, 2001] [Larsen and Aone, 1999] [Tan, 1999] [Rajman and Besan, 1997] is also known as text data mining [Hearst, 1997] or knowledge discovery from textual databases [Feldman & Dagan, 1995].

18 3 that TM handles unstructured text while DM process structured data in databases, (2) IR in that TM looks for novel patterns in text (in other words, TM is a exploratory analysis) while IR fetches users already-existing relevant documents, (3) IE in that while IE extracts facts or events of interest to users (e.g. protein names) using the named entity recognition (NER) technologies and identifies the relationships among them (e.g. proteinprotein interaction patterns in text), TM, using the interesting domain-specific facts and the relationships among them, discovers novel knowledge (e.g., the identification of protein functionality). These differences are summarized in Table 1. Table 1: Difference between TM, DM, IR, IE, and Database Query Looking for Novel Knowledge Already-known Facts Structured Data Data Mining (DM) Database Query Unstructured Data (i.e., text) Text Mining (TM) Information Retrieval (IR) and Information Extraction (IE) The ultimate goal of text mining would be to lift the burden of information overload from researchers. In order to reach the goal, TM has been used for document categorization [Yang and Pedersen, 1997], document clustering [Yoo et al, 2006] [Larsen and Aone, 1999], text summarization [Mukherjea and Bamba, 2004], trends analysis [Lent et al, 1997] [Nomiyama, 1997], question answering [Radev et al, 2002], hypothesis generation [Swanson, 1986], etc. For instance, document clustering enables us to group

19 4 similar text information and text summarization provides condensed text information by extracting the most important text content from a similar document set or a document cluster. In addition, using complementary structures in disjoint literatures (e.g., Swanson s ABC model [Swanson, 1986] [Swanson, 1991]) makes it possible to generate reasonable hypotheses based on evidences from biomedical literature. Traditional text mining approaches, however, have three major problems. First, traditional approaches are based on the vector space model. The use of vector space representation for documents causes two major limitations. The first limitation is the vector space model assumes all the dimensions in the space to be independent. In other words, the model assumes that words/terms are mutually independent in documents. However, most words/terms in a document are related to each other. This is a fundamental problem of the vector space model on document representation. For example, consider the word set, {Vehicle, Car, Motor, Automobile, Auto, Ford}; they are not independent but are closely related. The second limitation is that text processing in a high dimensional space significantly hampers its similarity detection for objects (here, documents) because distances between every pair of objects tend to be the same regardless of data distributions and distance functions [Beyer et al., 1999]. Thus, it may dramatically decrease clustering performance. Second, most traditional text mining approaches do not consider semantically related words/terms (e.g., synonyms or hyper/hyponyms). For instance, they treat {Cancer, Tumor, Neoplasm, Malignancy} as different terms even though all these words have very similar meaning. This problem may lead to a very low relevance score for relevant documents because the documents do not always contain the same forms of

20 5 words/terms. In fact, the problem comes intrinsically from the fact that traditional document clustering approaches do not perceive objects nor understand what the objects mean. Lastly, on vector representations of documents based on the bag-of-words model, traditional text mining approaches tend to use all the words/terms in the documents after removing the stop-words. This leads to thousands of dimensions in the vector representation of documents; this is called the Curse of Dimensionality. However, it is well known that only a very small number of words/terms in documents have distinguishable power on clustering documents [Wang et al., 2002] and become the key elements of text summaries. Those words/terms are normally the concepts in the domain related to the documents. Until now, most biomedical text mining approaches unfortunately resort to the use of neither ontologies nor even simple thesaurus. They rely on only machine learning methods, which usually assume all the objects (words/terms in text mining) are independent one another, which is a deep-rooted problem of text mining, as mentioned earlier. Moreover, the most challenging issues on applying text mining to the biomedical domain are the high terminological variation and the complex semantic relationships among biomedical terms, emphasizing the use of biomedical ontologies on text mining. These problems explain why traditional text mining approaches applied to biomedical domain generally yield inferior results to other domains (e.g., newswire) [Spasic et al, 2005]. All these problems of traditional text mining approaches have motivated this research. In this thesis, we introduce a semantic text mining approach. The semantic text

21 6 mining (STM) is different from traditional text mining in that STM makes use of domain knowledge in ontologies related to target text on text mining [Yoo et al, 2006]. The primary motivations behind the use of domain knowledge in ontologies on text mining are the following. First, the use of ontologies is the only way to handle the complex semantic relationships among words/terms in text because ontologies supply synonyms sets for every concept (e.g., Entry terms in MeSH [NLM-MeSH, 2006]) and hierarchically arrange concepts from most general to most specific in concept hierarchy 5. To this end, the simple use of ontologies in text mining allows us to easily solve the traditional synonym/hypernym/hyponym problems 6. In addition, through tracking the concept hierarchy, ontologies make text mining approaches recognize the relationship between two terms for measuring the semantic similarities between documents and spanning disparate biomedical information in different documents for automatic hypothesis generation. Second, ontologies make it possible to link new discoveries in biomedical literature to existing biomedical knowledge for knowledge induction (i.e., extracting unknown patterns or rules from particular facts or instances in documents) as well as knowledge management including ontology learning. 1.1 Research Questions The purpose of this research is to design and develop some important components of a semantic text mining framework and to evaluate them by applying it to biomedical domain (i.e., MEDLINE). Those components are graphical representation method for 5 The terms in ontologies normally appear in more than one place in the hierarchy so the terms are actually represented in a graph. 6 Information retrieval also has these problems.

22 7 documents that relies on domain ontologies, document clustering taking advantage of the scale-free network theory to mine the corpus-level graphical representation, text summarization, and a semantic version of Swanson s ABC model [Swanson, 1986]. This research mainly investigates how much a semantic text mining approach improves overall performance compared with traditional text mining approaches. In other words, because the core of a semantic text mining is ontologies, we measure how the use of ontologies affect text mining process. Especially, we investigate what advantages the graphical representation method for documents has over the traditional vector space representation [Salton et al, 1975]. In addition, it is very important to investigate that the semantic text mining approach yields stable 7 results because the answer keys to the real text mining problems are unknown and the results from existing text mining solutions are picky in terms of their performance according to the experiment results in Chapter 4. Moreover, because the volume of biomedical literature is increasing at an unprecedented rate, the scalability is measured and compared with traditional text mining approaches. The following are four research questions that are addressed in this thesis. What advantages does the graphical representation method for documents have over the traditional vector space representation method? How can ontologies improve text mining results? How stable does the semantic text mining approach output for datasets? Does the text mining performance heavily depend on datasets? 7 Here, stable means the corresponding approach provide high quality results regardless of data sets. This factor can be measure in standard deviation.

23 8 How scalable is the semantic text mining system, compared with traditional text mining system? 1.2 Contributions of the Thesis Although there is a growing interest in the use of ontologies on text mining or the semantic text mining, there are very few studies taking advantage of ontologies on text mining; most text mining approaches are based on only traditional information extraction and machine learning technologies. This research introduces the semantic text mining applied to biomedical domain. The primary contributions of this dissertation are four-fold. First we introduce graphical representation method for documents that take advantage of domain ontology and successfully apply the graphical representation method to document clustering approach. Second, the semantic document clustering approach is unique in that it provides users with document cluster models from an ontology-enriched scale-free representation of a set of documents, which are the summaries for each document cluster, and which explain document categorization. Thus, the document cluster models greatly improve the understandability of each document cluster. In addition, this is the first text mining approach to which the scale-free network theory is applied. Moreover, we demonstrate the superiority of this approach to a leading document clustering approach, Bisecting K-means, in terms of the clustering quality and clustering reliability. Third, in order to maximize the usefulness of document clustering (i.e., the understandability of document clusters), we introduce a text summarization approach that makes use of document cluster models. Finally, we introduce a semantic way to generate reasonable

24 9 hypotheses based on evidences from biomedical literature using the complementary structures in disjoint literatures. 1.3 Organization of Thesis The rest of the thesis is organized as follows: Chapter 2 surveys the works related to this dissertation. In Chapter 3, our novel research methods in document clustering, text summarization, and Swanson s ABC model are discussed. Chapter 4 explains the experimental evaluation including test datasets, evaluation methods, experimental setting, and evaluation results. Chapter 5 concludes the thesis with a summary of the major research findings, main contributions, and the directions for future work.

25 10 CHAPTER 2: RELATED WORK In this chapter, we discuss the core component of the semantic text mining (i.e., ontologies) and the related works to some important components of a semantic text mining framework, which are a document representation method, document clustering, text summarization, and a semantic version of Swanson s ABC model. 2.1 Ontologies Because the core of semantic text mining is the use of ontologies and the semantic text mining will be applied to biomedical domain, we briefly discuss the ontologies and some important biomedical ontologies. An ontology is a formal, explicit specification of a shared conceptualization for a domain of interest [Gruber, 1995]. To this end, an ontology is organized by concepts by identifying all the possible relationships among the concepts. Thus, for well-structured ontologies such as Medical Subject Headings (MeSH) or Unified Medical Language System (UMLS 8 ), the corresponding domain communities can reach a consensus on the knowledge in the ontologies. For this reason, ontologies can be used as domain knowledge for knowledge-based systems or intelligent agents. There are many biomedical ontologies; refer Open Biomedical Ontologies (OBO) 9 for a comprehensive list of biomedical ontologies. Each ontology except UMLS has its intended purpose and biomedical aspect; UMLS is NLM s effort to integrate all the major biomedical ontologies or vocabularies. Table 2 shows the most widely-used

26 11 biomedical ontologies. We use the MeSH ontology for graphical representation of documents. For the Swanson s ABC model problem, both MeSH and UMLS are used. In the following sections each of them is briefly discussed. Table 2: Selected Biomedical Ontologies Name Unified Medical Language System (UMLS) Medical Subject Headings (MeSH) Gene Ontology (GO) Homepage Medical Subject Headings (MeSH) Medical Subject Headings (MeSH) mainly consists of the controlled vocabulary and a MeSH Tree. The controlled vocabulary contains several different types of terms, such as Descriptor, Qualifiers, Publication Types, Geographics, and Entry terms. Among them, Descriptors and Entry terms are used in this research because only they can be extracted from documents. Descriptor terms are main concepts or main headings. Entry terms are the synonyms or the related terms to descriptors. For example, as shown in Figure 2, Neoplasms as a descriptor has the following entry terms { Cancer, Cancers, Neoplasm, Tumors, Tumor, Benign Neoplasm, Neoplasm,

27 12 Benign }. MeSH descriptors are organized in a MeSH Tree, which can be seen as the MeSH Concept Hierarchy, as shown in Figure 3. In the MeSH Tree there are 15 categories (e.g. category A for anatomic terms), and each category is further divided into subcategories. For each subcategory, corresponding descriptors are hierarchically arranged from most general to most specific. In addition to its ontology role, MeSH Descriptors have been used to index MEDLINE articles. For this purpose, about 10 to 20 MeSH terms are manually assigned to each article (after reading full papers) by highly-trained curators. On the assignment of MeSH terms to articles, about 3 to 5 MeSH terms are set as MajorTopics that primarily represent an article. Figure 2: For the MeSH Descriptor of Neoplasms and its definition and Entry Terms are provided.

28 13 Figure 3: A part of MeSH Tree Unified Medical Language System (UMLS) Unified Medical Language System (UMLS), started by National Library of Medicine as a long-term R&D project in 1986, provides a mechanism for integrating all the major biomedical vocabularies including MeSH. UMLS consists of three knowledge sources; Metathesaurus, Semantic Network, and SPECIALIST lexicon. Metathesaurus as a core is organized by concepts (meaning), synonymous terms are clustered together to form a concept, and concepts are linked to other concepts by means of various types of

29 14 relationships to provide the various synonyms of concepts and to identify useful relationships between different concepts [NLM-UMLS, 2006]. Currently, Metathesaurus (2006 AA version) contains more than 1 million biomedical concepts (meanings) and 5 million unique concept names from more than 100 different source vocabularies. All concepts are assigned to at least one semantic type as a category. For example, the term Raynaud disease has a semantic type [Disease or Syndrome], and Fish oils has a semantic type [Biologically Active Substance]. Currently, there are 135 semantic types and 54 relations. Each semantic type has at least one relationship with other semantic types. Both the semantic types and semantic relationships are hierarchically organized. Semantic relationships can be hierarchical (e.g., isa, part of ) or associative (e.g., treat-withdrug, interact-with ). Since most MeSH terms from Medline documents, are included into UMLS Metathesaurus Concepts, we know the semantic types of MeSH terms. Thus, given two MeSH terms, we can derive the relationship between them from their semantic relation. Figure 4 shows the relationships of concepts, semantic types, and semantic relations of Raynaud disease, blood viscosity and Fish oils. Using this semantic network of UMLS concepts, knowledge-based systems or intelligent software/agents are able to perceive that Fish oils affects blood viscosity and blood viscosity is one of symptoms of Raynaud disease. Further, they hypothesize that Fish oils would be a medicine for Raynaud disease.

30 15 Figure 4: An illustrative example of the UMLS Gene Ontology (GO) The goal of Gene Ontology (GO) is to provide a controlled vocabulary about the genes and proteins of all organisms as well as knowledge of gene and protein roles [GO Consortium, 2006]. The controlled vocabulary (simply GO terms) is taxonomically grouped into three structured networks (molecular function, biological process and cellular component) to describe gene product attributes. The networks are structured in a directed acyclic graph because it is possible for a gene product that has many molecular functions to be used in many biological processes and to be related to many cellular components. Although GO terms are in the form of a graph, all GO terms are rooted (hierarchically arranged) to in GO_Ontology concept. However, a GO term may have many parents and/or many children in different levels.

31 Vector Space Representation of Documents Because the current text mining (or information retrieval) technologies are not able to read and understand text like human beings due to its unstructured nature, text mining imposes some structure on text so that existing data mining or machine learning algorithms can easily process text by reducing the complexity of text. In IR, which has longer history than text mining, this problem has been solved by using the vector space representation of documents [Salton et al, 1975] for decades, since various information retrieval objects including user queries as well as documents are modeled in vector space. In the vector space model, a text is represented as a vector by means of representative keywords called index terms. These index terms are derived from the text through document indexing so that the semantics of them indicate the main themes of the corresponding documents [Baeza-Yates and Ribeiro-Neto, 1999]. Because every text contains a limited number of words, most document vectors are very sparse. In addition to the index term selection, term weights have been regarded as important because the weights reflect the importance of each term in the content of the documents. In the vector space model, the most widely-used weighting scheme is TF*IDF, which is the combination of the term frequency (TF), first used by Luhn in 1950s [Luhn, 1958] and the inverse document frequency (IDF). TF*IDF of a term is expressed as a product of the probability that the term occurs, indicating TF when it is normalized by the sum of frequencies in the document, and the amount of information that it represents implying IDF in terms of information theory. [Aizawa, 2003] TF*IDF is mathematically rendered as

32 17 freq CorpusSize TF * IDF = log 2 DocSize DF, where freq is the number of times a term occurs in a document, DocSize is the number of words in a document, DF is the number of documents containing the term in the corpus, and CorpusSize is the number of documents in the corpus. 2.3 Document Clustering The problem of document clustering is defined as follows. Given a set of n documents called DS, DS is clustered into a user-defined number of k document clusters DS 1, DS 2, DS k, (i.e. {DS 1, DS 2, DS k } = DS) so that the documents in a document cluster are similar to one another while documents from different clusters are dissimilar. In order to measure similarities between documents, documents have been represented based on the vector space model. In this model, each document d is represented as a high dimensional vector of words/terms frequencies (as the simplest form), where the dimensionality indicates the vocabulary of DS. Similarity between two documents has been traditionally measured by the cosine of the angle between their vector representations though there are a number of similarity measurements. Based on a cluster criterion function as an iterative optimization process that measures key aspects of intercluster and intra-cluster similarities, documents are grouped. A number of document clustering approaches have been developed for several decades. Most of these document clustering approaches are based on the vector space representation and apply various clustering algorithms to the representation. Thus, the

33 18 approaches can be categorized as hierarchical or partitional [Kaufman and Rousseeuw, 1999]. Hierarchical agglomerative clustering algorithms were used for document clustering. The algorithms successively merge the most similar objects based on the pairwise distances between objects until a termination condition holds. Thus, the algorithms can be classified by the way they select the pair of objects for calculating the similarity measure (e.g., single-link, complete-link, and average-link). An advantage of the algorithms is that they generate a document hierarchy so that users can drill up and drill down for specific topics of interest. However, due to their cubic time complexity, they are limited for a very large number of documents. Partitional clustering algorithms (especially K-means) are the most widely-used algorithms in document clustering [van Rijsbergen, 1979]. Most of the algorithms first randomly select k centroids and then decompose the objects into k disjoint groups through iteratively relocating objects based on the similarity between the centroids and the objects. As one of the most widely-used partitional algorithms, K-means minimizes the sum of squared distances between the objects and their corresponding cluster centroids. As a variation of K-means, BiSecting K-means [Steinbach et al, 2000] first selects a cluster (normally the biggest one) to split and then splits the objects into two groups (i.e. k = 2) using K-means. One major drawback of partitional algorithms is that clustering results are heavily sensitive to the initial centroids because the centroids are randomly selected. There are some hybrid document clustering approaches that combine hierarchical and partitional clustering algorithms. For instance, Buckshot [Cutting et al, 1992] is

34 19 basically K-means but uses the average-link to set cluster centroids with the assumption that hierarchical clustering algorithms provide superior clustering quality to K-means. However, as Larsen & Aone [Larsen and Aone, 1999] pointed out that using a hierarchical algorithm for centroids does not significantly improve the overall clustering quality, compared with the random selection of centroids. Recently, Hotho et al. introduced the semantic document clustering approach that uses background knowledge [Hotho et al, 2002]. The authors apply an ontology during the construction of a vector space representation by mapping terms in documents to ontology concepts and then aggregating concepts based on the concept hierarchy, which is called concept selection and aggregation (COSA). As a result of COSA, they resolve a synonym problem and introduce more general concepts in the vector space to easily identify related topics [Hotho et al, 2002]. Their method, however, cannot reduce the dimensionality (i.e. the document features) in the vector space; it still suffers from the Curse of Dimensionality. While all the approaches mentioned above represent documents as a feature vector, Suffix Tree Clustering (STC) [Zamir and Etzioni, 1998] does not rely on the vector space model. STC does not treat a document as a set of words, where the order is not important, but rather as an ordered sequence of words (i.e. a set of phrases). In fact, phrases instead of words have long been used in IR systems [Buckley et al, 1995]. One of major drawbacks of STC is that semantically similar nodes may be distant within a suffix tree, because STC does not consider the semantic relationships among phrases (nodes or base clusters). In addition, some common expressions may lead to combine unrelated documents. Recently, Eissen et al. applied STC to the RCV1 document collection of

35 20 Reuters Corporation and showed STC did not produce good clustering results; the average F-measure was 0.44 [zu Eissen et al, 2005]. 2.4 Text Summarization Text summarization has been studied since Luhn s work in 1958 [Luhn, 1958]. Since then, a variety of summarization approaches have been introduced. For instance, there are statistical methods based on the bag-of-words model, linguistic methods using natural language processing, knowledge-based methods using concepts and their relations, and summary generation methods. The first three approaches try to seek the most important information (usually sentences or terms) for a condensed version of documents while the last approach generates completely a new summary that consists of informative terms, phrases, clauses and sentences. The main difficulty of the last approach is to figure out how to combine them to make sentences that are grammatically correct. In the bioinformatics/biomedical field many multi-document summarization systems have also been introduced. TextQuest [Iliopoulos et al., 2001] is designed to summarize documents retrieved in response to a keyword(s) based search on PubMed. However, it does not retain the association between the genes and the retrieved documents. MedMiner [Tanabe et al., 1999] can provide summarized literature information on genes but it is limited when finding relations between two genes only. In addition, it returns a few hundred sentences as the summary. Shatkey et al. [Shatkey et al., 2000] suggested a system, which attempts to find functional relations among genes on a genome-wide scale. However, this system requires the user to specify a representative document for each gene which describes the gene very well. Looking for the

36 21 representative document may take a lot of time, effort and knowledge on the part of the user. In addition, as genes have multiple biological functions, it is very rare to find a document that covers all aspects of a gene across various biological domains. GEISHA [Blaschke et al., 2001] is based on the comparison of the frequency of abstracts linked to different gene clusters. Interpretation by the end user of the biological meaning of the terms is facilitated by embedding them in the corresponding significant sentences and abstracts and by establishing relations with other, equally significant terms. 2.5 Swanson s ABC model The huge volume of the biomedical literature provides a nice opportunity and challenge to induce novel knowledge by finding novel connections among logicallyrelated medical concepts. For example, Swanson s ABC model formalizes the procedure to discover UnDiscovered Public Knowledge (UDPK 10 ) [Swanson, 1986], from biomedical literatures as follows (see Figure 5): Consider two separate sets of literature BC and AB, where documents in BC discuss concept C and documents in AB discuss concept A. Each of these two sets of literatures discusses their relationships with some intermediate concepts B (also called bridge concepts). Their possible connection via the concepts B, however, is not discussed together in any of these two groups of literature. The goal is to find some novel connections between the target concepts A and starting concept C as shown in Figure The original definition of UDPK by Swanson is that UDPK is a knowledge which can be public, yet undiscovered, if independently created fragments are logically related but never retrieved, brought together, and interpreted [Swanson, 1986].

22 Figure 5: Swanson s ABC model for UDPK Swanson s ABC model can be described as the process to induce A implies C, which is derived from both A implies B and B implies C ; the derived knowledge or

37 22 Figure 5: Swanson s ABC model for UDPK Swanson s ABC model can be described as the process to induce A implies C, which is derived from both A implies B and B implies C ; the derived knowledge or relationship A implies C is not conclusive but hypothetical. The B concepts are the bridge between C and A concepts. The following steps summarize the procedure [Swanson, 1987]. 1. Specify the user s goal (a starting concept C such as a disease, or symptom etc) 2. Search the relevant documents BC from the biomedical literature for C; 3. Generate a set of selected words ( B list) from BC using predefined stop-list filter; B concepts are chosen from only the titles of the documents. 4. Search literatures to get those documents AB related to the B concepts. 5. Generate a set of words ( A candidates) from the AB, A concepts are also from only the title of the documents. 6. Check whether A and C are cocited together in the literature, if not, Keep A

38 23 7. Rank A terms based on how many linkages are made with B terms. One of the drawbacks of Swanson s method is that a large amount of manual intervention is required. Even though he and his colleague designed an interactive tool called Arrowsmith to automate some of the steps [Swanson and Smalheiser, 1999], the procedure still requires much manual intervention, such as the choice of proper lists of stop words and filtering through a large number of connections to identify the real novel connections/hypotheses. Another problem is that the relationships or associations among a huge number of the biomedical concepts grow exponentially. The key of UDPK problem is how to exclude meaningless concept pairs (A-B and B-C) (because automated process yields too many irrelevant suggestions) and how to span disparate islands of information (i.e., independently created fragments in Swanson s term). Several algorithms have been developed to overcome the limitations of Swanson s approach. [Hristovski et al, 2001] used the MeSH descriptors rather than the title words of the documents. They use association rule algorithms to find the cooccurrence of the words. Their methods find all B concepts as bridges that are related to the starting concept C. Then all A concepts related to B concepts are found through Medline searching. But in Medline each concept can be associated with many other concepts, the possible number of B C and A B combinations can be extremely large. In order to deal with this combinatorial problem, the algorithm incorporates filtering and ordering capabilities [Hristovski et al, 2003] [Hristovski et al, 2001] [Joshi et al, 2004] [Pratt and Yetisgen-Yildiz, 2003] [Srinivasan, 2004]. Pratt and Yetisgen-Yildiz [Pratt and Yetisgen-Yildiz, 2003] used Unified Medical Language System (UMLS) concepts instead of MeSH terms assigned to Medline documents. Similar to Swanson s method,

39 24 their search space is limited by only the titles of documents for the starting concept. They can reduce the number of terms (B concepts and A concepts) by limiting the search space. In addition to that, they reduce the number of terms/concepts by pruning out terms that are too general (e.g., terms such as problem, test, etc.), too closely related to the starting concept, and meaningless. They defined a term too general if the term is found in the titles of more than 10,000 documents. For too closely related to the starting concept, they tracked all the parents and children concepts of the starting concept and then eliminated the related terms. To avoid meaningless terms, they followed the same method as in [Hristovski et al, 2001], manually selected a subset of semantic types to which the collected terms should belong. Before generating association rules, they tried to group the concepts (B or A concepts) to get a much coarser level of synonyms. Then, they removed too general concepts by looking at their UMLS hierarchy level and non- UMLS concepts. With the qualified and grouped UMLS concepts, they used the wellknown Apriori algorithm [Agrawal et al, 1995] to find correlations among the concepts. Although they managed to simulate Swanson s migraine-magnesium case only through concept grouping, their method still requires strong domain knowledge, especially on selecting semantic types for A and B concepts and also some vague parameters on defining too general concepts. Srinivasan [Srinivasan, 2004] viewed Swanson s method as two dimensions. The first dimension is about identifying relevant concepts for a given concept. The second dimension is about exploring the specific relationships between concepts. However, only [Srinivasan, 2004] deals with the first dimension. The key point of this approach is that MeSH terms are grouped into the semantic types of UMLS to which they belong.

40 25 However, only a small number (8 out of 134) of semantic types are considered since the author believes those semantic types are relevant to B and A concepts. For each semantic type, MeSH terms that belong to the semantic type are ranked based on the modified TF*IDF. There are some limitations in their method. First, the author used manuallygenerated semantic types for filtering. Second, the author applied the same semantic types to both A and B terms. Because the roles of A and B terms for C term are different, different semantic types should be applied. These research works have made significant progress on Swanson s method. However, none of the approaches mentioned above consider the specific semantic relationships. However, the association problem should be tackled by not only the information measure but also the semantic information among the concepts.

41 26 CHAPTER 3: SEMANTIC TEXT MINING AND ITS APPLICATION IN BIOMEDICAL DOMAIN In this chapter, we discuss four important components of a semantic text mining framework. Those components are (1) a graphical representation method for documents that relies on domain ontologies, (2) document clustering taking advantage of the scalefree network theory to mine the corpus-level graphical representation generated by the graphical representation method, (3) text summarization, and (4) a semantic version of Swanson s ABC model [Swanson, 1986]. Additionally, we discuss another document clustering approach using semantic-based bipartite graph representation and mutual refinement strategy; this document clustering approach would be the alternative to the scale-free based document clustering mentioned earlier. 3.1 Graphical Representations of Documents All text mining approaches must first to convert documents into a proper format. Since we recognize documents as a set of concepts that have their complex internal semantic relationships, each document is represented as a graph structure using the MeSH ontology. The primarily motivations behind the graphical representation of documents are the following. First, the graphical representation of documents is a very natural way to portray the contents of documents because the semantic relationship information about the concepts in documents remains on the representation while the vector space representation loses all the information. Second, the graphical representation method provides document representation independence. This means that the graphical

42 27 representation of a document does not affect other representations. In the vector space representation, the addition of a single document usually requires the changes of every document representation. Third, the graphical representation guarantees better scalability than vector space model. Because a document representation is an actual data structure on text processing, its size should be as small as possible for better scalability. As the number of documents to be processed increases, a corpus-level graphical representation at most linearly expands or keeps its size with only some changes on edge weights, while a vector space representation (i.e. document*word matrix) at least linearly grows or increases by n*t where n is the number of documents and t is the number of distinct terms in documents. We represent the graph as a triple G = (V, E, w), where V is a set of vertices that represent MeSH Descriptors, E is a set of edges that indicate the relationships between vertices, and w is a set of edge weights that are assigned according to the strength of the edge relationships. The relationships are derived from both the concept hierarchy in the MeSH ontology and the concept dependencies over documents. All the details are discussed below. The whole procedure takes the following two steps: the concept mapping in documents and the construction of individual graphical representations with both mapped concepts and their higher-level concepts. First, the concept mapping matches the terms in each document to the Entry terms in MeSH and then maps the selected Entry terms into MeSH Descriptors. The process follows: Instead of searching all Entry terms in the MeSH against each document, 1 to 3-gram words are selected as the candidates of MeSH Entry terms after removing all stop words from each document. The process selects those

43 28 candidate terms that only match with MeSH Entry terms and then replaces those semantically similar Entry terms with the Descriptor term to remove synonyms. Next, the process filters out some MeSH Descriptors that are too general (e.g. HUMAN, WOMEN or MEN) or too common in MEDLINE articles (e.g. ENGLISH ABSTRACT or DOUBLE-BLIND METHOD); see Section for details. We assume that those terms do not have distinguishable power in clustering documents. Hence, we have selected a set of only meaningful corpus-level concepts, in terms of MeSH Descriptors, representing the documents. The first step is illustrated in Figure 6. This figure shows that MeSH Entry term sets are detected from Doc 1 and Doc 2 documents, and then the Entry terms are replaced with Descriptors using the MeSH ontology. Figure 6: The Concept Mapping from MeSH Entry Terms to MeSH Descriptors

44 29 In the second step, it extends the detected MeSH concepts by incorporating higher-level (i.e. more general) concepts in the MeSH Tree on a graphical representation. The main purpose of the concept extension is to make the graphical representation richer in terms of meaning. The primary benefit of the concept extension is to help users recognize similar topics. For example, a migraine document may involve the following concepts { VASCULAR HEADACHES, CEREBROVASCULAR DISORDERS, BRAIN DISEASES, CENTRAL NERVOUS SYSTEM DISEASES } using MIGRAINE concept in the document through its concept extension, and these extended concepts may link the document to any vascular headache documents. For each step of the concept extension, an edge consisting of a concept and its higher-level concept is drawn in the graph. For such new edges, weights are assigned based on their extension lengths. This is based on the fact that as the layers of the concept hierarchy go up, concepts become more general and less similar than concepts at lower levels. In this way, as concept-extensions are made from a base concept, the weights of the new edges by the concept-extensions decrease. The mechanism can be explained with the taxonomic similarity [Rada et al, 1989] or the set similarity (i.e. α β = β, where α is a set of α β α all the parent concepts of a concept plus the concept and β is a set of all the parent concepts of its immediate higher-level concept and plus the concept). Figure 7 illustrates this second step. Based on the MeSH Tree, Descriptor terms (e.g. {B,C,H} for the document D 1 ) of each document are extended with their higherlevel concepts (e.g., {A,E,J} in Figure 1; my approach involves higher-level concepts up to before the 15 category sub-roots of the MeSH Tree. The mechanism of edge weights is

45 30 simple. The weight of edge B-A, for example, is { BAE,, } { AE, } 2 =. For identical { BAE,, } { AE, } 3 edges (e.g., A-E and Q-S), their weights add up. For example, the weight of edge A-E is { AE, } { E} 2 = 1 { AE, } { E}. Note that the thickness of the edges in the graphical representations indicates the edge weights; the thicker the heavier weight. Figure 7: Individual graphical representations for each document 3.2 Document Clustering using Scale-free Graphical Representation The document clustering in a semantic text mining framework consists of three steps: (1) Integration of individual graphs for corpus-level graphical representation, (2) Graph Clustering for the corpus-level graphical representation, and (3) Model-based Document Assignment. The first step is to integrate the individual graph representations by the graphical representation method mentioned in the previous section into the corpus-

46 31 level graphical representation of a set of documents. The purpose of the integration is to identify the semantic chunks capturing the semantic relationships among the terms in each document clusters. In order to identify the semantic chunks in the corpus-level graphical representation of a set of documents, the graph clustering is employed, which is the second step. This graph clustering algorithm takes advantage of the (power-law) term distribution in documents. The output of the second step is the k semantic chunks called document cluster models. Using the models each document is assigned to a proper document cluster model, which generates document clusters Integration of Individual Graphs into Corpus-level Graphical Representation In the first step, it merges the individual graphs generated from each document, into a corpus-level graph. In this step it further enriches the graph by reflecting concept dependence, which implies the necessary co-occurrence of concepts in documents, on the graph. This is based on the fact that co-occurrence concepts imply some semantic associations that the ontology cannot contain. The remaining problem for co-occurrence concepts is how to set the co-occurrence threshold; term pairs whose co-occurrence counts equal or bigger than the value are considered as co-occurrence terms. Because the threshold value fairly depends on documents or queries to retrieve documents, we develop a simple algorithm to detect a reasonable threshold value instead of just setting a fixed value. This algorithm tries to find a bisecting point in one-dimensional data. It first sorts the data, takes the two end objects (i.e. the minimum and the maximum) as centroids, and then assigns the remaining objects to the two centroids based on the distances between each remaining object and a centroid. After each assignment of the

47 32 objects, the centroids are updated. After obtaining the threshold value, co-occurrence concepts are mirrored as edges on the graph and their co-occurrence counts are used as edge weights. On the graph integration, edge weights add up for the identical edges. Figure 8 shows this step. The corpus-level graph is made by merging the individual graphs and by reflecting co-occurrence concepts as new edges. Note that the integrated graph in the Figure 8 is based on only the four documents (D 1 to D 4 ) and two co-occurrence concepts from the whole document set (D 1 to D n ). Figure 9 shows a real graph that is a typical scale-free network, which is discussed in Section Figure 8: Integration of individual graphs Additionally, Figure 8 presents one of the advantages of this approach. Although D 1 and D 3 documents, or D 2 and D 4 documents do not share any common concepts (thus, traditional approaches do not recognize any similarity between those documents), when

33 the documents are represented in graphs, their graphs can have some common vertices (e.g., {A,E,J} for D 1 and D 3 documents, and {L,S,Q} for D 2 and D 4 documents).

48 33 the documents are represented in graphs, their graphs can have some common vertices (e.g., {A,E,J} for D 1 and D 3 documents, and {L,S,Q} for D 2 and D 4 documents). Thus, D 1 and D 3 documents, and D 2 and D 4 documents are regarded as similar to each other. This is because my document representation method involves higher-level concepts relating semantically similar documents that do not share common terms. Figure 9: A graphical representation of a document set as a scale-free network. This graph is from a test corpus that consists of 21,977 documents and has 9 classes Graph Clustering for a Graphical Representation of Documents A number of phenomena or systems, such as Protein-protein interactions [Bader and Hogue, 2003], the Internet [Barabasi and Albert, 1999], and Social networks

49 34 [Wasserman and Faust, 1994] have been modeled as networks or graphs. Traditionally those networks were interpreted with Erdos & Rényi s random graph theory, where nodes are randomly distributed and two nodes are connected randomly and uniformly (i.e. Gaussian distribution) [Erdos and Rényi, 1960]. However, researchers have observed that a variety of networks such as those mentioned above, deviate from the random graph theory [Amaral et al, 2000] [Strogatz, 2001] in that a few most connected nodes are connected to a high fraction of all nodes (there are a few hub nodes). However, these hub nodes cannot be explained with the traditional random graph theory. Recently, Barabasi and Albert introduced the scale-free network [Barabasi and Albert, 1999]. The scale-free network can explain the hub nodes with high degrees because its degree distribution decays as a power law, P ~ γ ( k) k, where P(k) is the probability that a vertex interacts with k other vertices and γ is the degree exponent [Barabasi and Albert, 1999]. Recently, Ferrer-Cancho and Solé have observed that the graph connecting words in English text follows a scale-free network [Ferrer-Cancho and Solé, 2001]. Thus, the graphical representation of documents belongs to a highly heterogeneous family of scalefree networks. The Scale Free Graph Clustering (SFGC) algorithm is based on the scalefree nature (i.e. the existence of a few hub vertices (concepts) in the graphical representation). SFGC starts detecting k hub vertex sets (HVSs) as the centroids of k graph clusters and then assigns the remaining vertices to graph clusters based on the relationships between the remaining objects and k hub vertex sets. Figure 10 illustrates the flow of SFGC algorithm and Figure 11 shows the pseudo-code of the algorithm. Before we describe SFGC in detail, we define the following terms.

50 35 Hub vertices: a set of the most heavily-connected vertices in each graph cluster in terms of both the degrees of vertices and the weights of the edges connected to vertices due to the weighted graph. A graph cluster: a set of vertices that have stronger relationships with the hub vertices of the corresponding cluster than those of other clusters. A centroid: a set of hub vertices, not a single vertex because we assume a single term as a representative of a document cluster may have its dispositions so that the term may not have strong relationships with other key terms of the corresponding cluster. This complies with the scale-free network theory where centroids are a set of vertices that have high degrees. Figure 10: The Flow of Scale-Free Graph Clustering (SFGC) Algorithm

51 36 Algorithm: SFGC (Scale Free Graph Clustering) Input: a graph, k (the desired number of graph clusters), p (initial # of vertices for hub vertex process) Output: k graph clusters, k hub vertex sets // 1: Calculating Salient Scores of vertices V For each edge e j in E For each v i in e j Salience(v i ) += weight(e j ) End For End For //sort V in the descending order of Salience(v) Sort(V,Salience(v),des) // 2: Detecting k Hub Vertex Sets(HVS) LSI = 0 //loop start index LFI = p //# of vertices used for the HVS detection Do //iteration For each HVS i //there are k HVSs Do //nested iteration For j=lsi To LFI //v LSI to v LFI (V subset) If HasAnyRelationship(HVS i, v j ) HVS i = HVS i {v j } End If End For Loop While AnyMemberChanged(HVS i ) End For // Hub Vertex Set Qualification MergingSimilarSets(HVS, E) //E: an edge set LSI = LFI + 1 LFI += p Loop While AnyVacancy(HVS) // 3: Assigning the remaining vertices to Graph Clusters(GC) GC = HVS //copy vertices in each HVS to GC Do //iterations For each v in {V}-{HVSs} AssignVertexToBestFitCluster(v, HVS, GC) End For // updating HVS from GC UpdatingHVS(HVS, GC) Loop While AnyMemberChanged(GC) Return (GC) Figure 11: Pseducode of Scale-Free Graph Clustering (SFGC) Algorithm

52 37 Detecting k hub vertex sets as cluster centroids The main process of the SFGC is to detect k hub vertex sets (HVS) as the centroids of k graph clusters. HVS is a set of vertices with high degrees in a scale-free network. Because HVSs are the cluster centroids, we might consider betweenness-based methods such as Betweenness Centrality [Newman, 2004b] to measure the centrality of vertices in a graph; see [Newman, 2004a] for the latest comprehensive review. However, those methods lead to cubic running times [Newman, 2004b] so that they are not appropriate for very large graphs. A recent scale-free network study [Wu et al, 2004] reports that Betweenness Centrality (BC) yields better experiment results to find cluster centroids than random sampling, degree ranking, and well-known HITS but degree ranking is comparable with BC. If we consider the complexities of BC (O( V 3 ) and degree ranking (O( V ) in very large graphs, degree ranking should be selected. Unlike [Wu et al, 2004] that considers only the degrees (i.e. counting edges connected to vertices), we consider edge weights for a weighted graph. To this end, we introduce the salient scores of vertices that are obtained from the sum of the weights of the edges connected to vertices. The salience of a vertex is mathematically rendered as follows. Salience( vi) = weight of e {e e having v } j j j i e j In order to set highly salient vertices as HVS first, the vertices are sorted in the descending order based on their salient scores. Within the top n vertices SFGC iteratively searches a vertex that has a strong relationship with any vertices in each HVS because we assume all the vertices in a HVS are strongly related to each other. If a vertex has multiple relationships with more than a HVS, the HVS that has the strongest relationship

53 38 with the vertex is selected. After assigning a vertex, the vertex will not be used for HVS detection anymore. Sometimes, HVSs are semantically similar enough to be merged because a document set (or a document cluster) may have multiple but semantically related topics. In order to measure the similarity between HVSs, we calculate an intra-edge weight sum (as similarity) of each of the two HVSs and an inter-edge weight sum between the HVSs. This mechanism is based on the fact that a good graph cluster should have both the maximum intra-cluster similarity and the minimum inter-cluster similarity. Thus, if an inter-edge weight sum is equal to or bigger than any of intra-edge weight sums, the corresponding two HVSs are merged. If this happened, SFGC tries to seek a new HVS. Assigning remaining vertices to k graph clusters Each of the remaining vertices (i.e. non-hvs) is (re)assigned to the graph cluster to which the vertex is the most similar. The similarity is based on the relationships between the vertex and each of the k HVSs. The degree of being strong in relationships is measured in the sum of edge weights. In this way k graph clusters are populated with the remaining vertices. In order to refine the graph clusters it iteratively reassigns vertices to the clusters with the update of k HVSs from their graph clusters just like K-means that updates k cluster centroids at each iteration to improve cluster quality. During the updates of HVSs, it uses the bisecting technique, used for co-occurrence threshold, to select new HVS from the vertices in each graph cluster based on their salient scores. In other words, the technique separates the vertices in each graph cluster into two vertex groups (i.e. HVS and non-hvs). Using the new HVSs, the vertices are reallocated to the most similar

54 39 cluster. These iterations continue until no changes are made on clusters or stop at certain iteration. Finally, SFGC generates both graph clusters and HVSs as models. Figure 12 shows two sample HVSs generated from the graph in Figure 9. The significances of the graphic document cluster models are that (1) each model captures the core semantic relationship information about document clusters and provides the intrinsic meanings of them in a simple form; (2) this facilitates the interpretation of each cluster in terms of the key descriptors and could support the effective information retrieval Model-based Document Assignment So far, we discussed how documents are represented as a graph using the MeSH ontology and how the graph consisting of concepts is clustered into k graph clusters on the basis of HVSs. In this section, we explain how to assign each document to document clusters. In order to decide which document belongs to which document cluster, COGR matches the graphical representation of each document with each of the graph clusters as models. Here, we might adopt graph similarity mechanisms, such as edit distance (the minimum number of primitive operations for structural modifications on a graph).

55 40 Figure 12: Two sample graphical document cluster models from the corpus-level graphical representation in Figure 9.

56 41 However, these mechanisms are not appropriate for this task because individual document graphs and graph clusters are too different in terms of the number of vertices and edges. As an alternative to graph similarity mechanisms we take a vote mechanism. This mechanism is based on the classification (HVS or non-hvs) of the vertices in the graph clusters according to their salient scores. This classification leads to different votes. To this end, each vertex of each individual document graph casts two different numbers of votes for document clusters based on whether the vertex belongs to HVS or non-hvs. Each document is assigned to the document cluster that has the majority of votes in the document clusters. For example, suppose there are three graph clusters or models, as shown Table 3. To assign a document whose graph representation consists of {A, C, O, W} vertices to a document cluster, A casts x votes for Graph Cluster (GC) 1 because A belongs to the HVS of GC 1 and W casts y votes for GC 3 because W is found in the non-hvs of GC 3. Suppose x=5 and y=1 (we used these numbers of votes in my experiments), GC1 has 10 votes, GC2 has 1 vote, and GC3 has 1 vote. In this way GC 1 has the majority votes and thus the document is assigned to GC 1 (i.e. document cluster 1). Table 3: HVS and non-hvs for Sample Graph Clusters HVS Non-HVS Graph Cluster 1 {A,B,C} {D,E,F,G,H,I,J,K} Graph Cluster 2 {L,M,N} {O,P,Q,R,S,T} Graph Cluster 3 {U,V} {W,X,Y,Z}

57 Text Summarization Text summarization is to condense information in a set of documents into a concise summary. The text summarization problem has been addressed by selecting and ordering sentences in documents based on a salient score mechanism [Harabagiu and Lacatusu, 2005]. We address the problem by analyzing the semantic interaction of sentences (as summary elements). This semantic structure of sentences is called Text Semantic Interaction Network (TSIN), where vertices are sentences. We select sentences (vertices in the network) as summary elements based on degree centrality. Unlike traditional approaches, we do not use linguistic features 11 for summarization for MEDLINE abstracts since they usually consist of only single paragraphs. Text summarization takes three steps: (1) making ontology-enriched graphical representations for each sentence using semantic relationships in document cluster models (2) constructing Text Semantic Interaction Network (TSIN) using sentences as nodes and the semantic relationships in the document cluster model (3) selecting significant text contents for summary by considering their centrality in the network Making Ontology-enriched Graphical Representations for Each Sentence The first step of the graphical representation for sentences is basically the same as the graphical representation method for documents shown in Section 3.1 except concept extension and individual graph integration. In this step the concepts in sentences are extended using the relationships in relevant document cluster models rather than the 11 For example, the first sentence in text is more important than other sentences.

58 43 entire concept hierarchy. In other words, we extend concepts within relevant semantic field Constructing Text Semantic Interaction Network (TSIN) The key process of text summarization is how to select salient sentences (or paragraphs in some approaches) as summary elements. We assume that the sentences in the summary have the strong semantic relationships with other sentences because summary sentences cover the main points of a set of documents and comprise a condensed version of the set. In order to represent the semantic relationship among sentences, we construct Text Semantic Interaction Network (TSIN), where vertices are sentences, edges are the semantic relationship between them, and edge weights indicate the degree of the relationships. In order to deal with the semantic relationships between sentences and calculate the similarities (as edge weight in the network) between them, we use edit distance between the graphical representations of sentences. The edit distance between G1 and G2 is defined as the minimum number of structural modification required to make G1 equal G2, where structural modification is one of vertex insertion, vertex deletion, and vertex update. For example, the edit distance between the two graphical representations of D 1 and D 2 in Figure 13 is 5.

59 44 Figure 13: Edit Distance between Two Graphical Representations of D 1 and D Selecting Significant Text Contents for Summary A number of approaches have been introduced to identify important nodes (vertices) in networks (or graphs) for decades. These approaches are normally categorized into degree centrality based approaches and between centrality based approaches. The degree centrality based approaches assume that nodes that have more relationships with others are more likely to be regarded as important in the network because they can directly relate to many other nodes. In other words, the more relationships the nodes in the network have, the more important they are. The betweenness centrality based approaches views a node as being in a favoured position to the extent that the node falls on the geodesic paths between other pairs of nodes in the

60 45 network [Hanneman and Riddle, 2005]. In other words, the more nodes rely on a node to make connections with other nodes, the more important the node is. These two approaches have their own advantages and disadvantages. For example, betweenness centrality based approaches yield better experiment results to find cluster centroids than other relevant approaches, as mentioned in Section 3.2.2, while they require cubic running times so that they are not appropriate for very large graphs. Degree centrality based approaches have been criticized because they only take into account the immediate relationships for each node while they require the linear running time and provide comparable output quality with betweenness centrality based approaches. To this end, we adopt degree centrality to measure the centrality of sentences in TSIN because of its linear computational time. In order to overcome its disadvantage, mentioned above, we measure, for each node, the semantic relationships with all other nodes (i.e., pairwise similarities for every pair of nodes) so that both immediate and distant relationships that each node has are considered while using degree centrality. The proposed text summarization approach takes advantage of the document cluster models by the document clustering method. The coherence of document clustering and text summarization is required because a set of documents are usually multiple-topics. For this reason text summarization does not yield high-quality summary without document clustering. On the other hand, document clustering is not very useful for users to understand a set of documents if the explanation for document categorization or the summaries for each document cluster is not provided. In other words, document clustering and text summarization are complementary. This is the primary motivation for the coherent approach of document clustering and text summarization.

61 A Semantic Version of Swanson s ABC model Two key problems in mining the biomedical literature for UDPK are: (1) to determine which already-established connections to the starting concept (such as Raynaud disease) should serve as a bridge; and (2) to decide to which other novel-butrelated concepts this bridge might link to form a novel hypothesis. We propose a semantic-based mining approach that explains how relationships or associations among concepts can be semantically induced. Developing a semantic-based mining approach from text with minimum intervention from domain experts and minimum training examples has always been a challenge for researchers. In the biomedical domain, numerous advances in biomedical ontologies such as UMLS, MeSH have made this challenging task a possibility now. We present a system Bio-SbKDS (Biomedical Semantic-based Knowledge Discovery System) to automatically mine the undiscovered public knowledge from the biomedical literature using a combination of ontology knowledge and data mining. We rely on biomedical ontologies, such as UMLS and MeSH for identifying biomedical concepts and their semantic types and semantic relationships among them. Compared to previous research work, Bio-SbKDS uses the semantic network in UMLS to identify meaningful correlations among concepts, and uses those correlations for open-ended discovery. In contrast, association rule-based approaches generate all the possible connections for the medical concepts, but only a tiny portion of those concepts would make the linking term medically plausible. Bio-SbKDS however, generates much fewer connections to capture discoveries that are likely to be novel, and uses semantic knowledge to significantly reduce the search spaces in the

62 47 discovery procedure. In order to create an automated approach to identify those interesting and meaningful terms for the B and A concepts, we rely on semantic types (e.g., medical condition or disease and a potential treatment) that were plausible for terms that could be correlated. Bio-SbKDS then filters out any concepts that didn t match the semantic-type criteria. In contrast, Swanson addressed this problem by manually creating a customized list of stop words to filter out the uninteresting concepts, but such a level of word-based customization could be difficult to scale to new medical concepts and connections. Since most MeSH terms from Medline documents, are included into UMLS Metathesaurus Concepts, we know the semantic types of MeSH terms. Thus, given two MeSH terms, we can derive the relationship between them from their semantic relation. Figure 4 shows the relationships of concepts, semantic types, and semantic relations of Raynaud disease, blood viscosity and Fish oils The Algorithm Bio-SbKDS We have developed a biomedical literature mining system, called Biomedical Semantic-based Knowledge Discovery System (Bio-SbKDS). The inputs are a Medline search keyword as a MajorTopic MeSH term plus date range, the possible semantic relationships between C (the starting concept) and the to-be-discovered target concepts, and the role of the keyword for the initial semantic relations. For example, if the starting concept is Raynaud disease, the relations selected are treats and prevents because we try to find something (the target concepts A) that treats or prevents Raynaud disease

63 48 Our algorithm takes full advantage of the semantic knowledge in UMLS to select appropriate semantic types for B and A concepts through mutual qualifications and to identify relevant B and A concepts. The advantage of the algorithm is that, using only initial relations (possible relationships between C concept and A concepts), all the semantic types for both B concepts and A concepts are automatically derived using the biomedical ontology (UMLS). Because there must be at least one relationship between the semantic types for B and the semantic types for A concepts, the derived semantic types for A and B concepts are mutually qualified by considering their relationships (to be explained in STEP 5 below). Algorithm Bio-SbKDS INPUT: Starting concept C as MeSH term plus date range, the initial semantic relations ISR between the starting concept and the to-be discovered target concept, the role of keyword for possible relations (subject or object) OUTPUT: Target Concept List (A concepts) Procedural STEP 1 Find the semantic types ST_C of the starting concept C from the ontology UMLS; STEP 2 Find all the possible semantic types of the to-be-discovered concepts B related to ST_C; the semantic types derived are called ST_B_can (can means candidates), and are used as the category restriction for B concepts. STEP 3 Extract all semantic types related to ISR, which are the candidate

64 49 semantic types for the to-be-discovered target concepts A, the result is denoted as ST_A_can. STEP 4 Extend ST_A_can obtained in STEP 3 by following through the ISA relations; the extended semantic types are called ST_A_can_ext. STEP 5 Check if there are relations between ST_B_can and ST_A_can_ext and also if the two semantic type sets pass the relation filter. If not, such semantic types are dropped from their semantic type list. After removing irrelevant semantic types, ST_B_can becomes ST_B and ST_A_can_ext becomes ST_A STEP 6 Search the biomedical literature to get all the documents CL related to C; CL is the source of B concepts. Then, extract MeSH terms from CL; the terms are called B_can. STEP 7 Apply B concept category restriction (ST_B) to B; selecting the terms that only belong to at least one semantic type of ST_B. In addition, Bi-Decision Maker [8] further qualifies B_can. Here, the top ranked B terms, called B_top, are selected. STEP 8 Search all B_top terms to get all the documents AL; AL is the source of the to-be discovered A concepts. Then, extract MeSH terms from AL; the terms are called A_can. STEP 9 Apply A concept category restriction (ST_A) to A_can. In addition, Bi-Decision Maker further qualifies A_can.

65 50 STEP 10 From A_can, retain those not co-occurred with C concept in Medline. The top ranked A concepts are selected. Figure 14 shows the data flow of the procedure of mining the undiscovered public knowledge. Each number circled in Figure 14 indicates the corresponding step in the algorithm. Below we explain each step in great details using the Raynaud disease as our example. Figure 14: The Data Flow of Bio-SbKDS

66 51 STEP 1: The semantic type of the starting concept C (ST_C) is identified through UMLS semantic network. At this time, only a MeSH term is allowed as a starting concept because the semantic type of the starting concept is used to construct the semantic type list for the B terms. For example, for the Raynaud disease, its semantic type is [Disease or Syndrome]. STEP 2: All the semantic types (ST_B_can), which have at least one of the relations in the relation filters with ST_C (the semantic type of the keyword), are selected by considering the role of the initial keyword (i.e. as subject or as object). For example, in Table 4 [Physiologic Function] and [Steroid] are selected because the role of the initial keyword is set as an object on the interactive system and the relation filter includes process_of, result_of, and causes ; just regarding each record in Table 4 as a sentence (e.g. Steroid causes Disease or Syndrome). The relation filter between C and B are shown in Table 5. The semantic types collected (ST_B_can) are used for the semantic types of B terms as category restriction. This is based on the fact that B terms have at least one relationship with C term. STEP 3: In order to derive the semantic types of A terms, the initial semantic relations (e.g. treats, prevents ) are used. Here, it is important that the C term is set as a subject or an object for the initial relations. If the term is set as an object, only the semantic types on the first (not third) column in the Table 6 are considered in the search space.

67 52 Table 4: Semantic Relations for some semantic types Semantic Types (as subjects) Relation Semantic Types (as objects) Physiologic Function process_of Disease or Syndrome Physiologic Function result_of Disease or Syndrome Steroid causes Disease or Syndrome Table 5: Relation Filter between C concept and B concepts process_of result_of manifestation_of causes Table 6: Semantic Relations for some semantic types Semantic Types (as subjects) Relation Semantic Types (as objects) Antibiotic treats Disease or Syndrome Drug Delivery Device treats Disease or Syndrome Medical Device (too general) treats Disease or Syndrome Pharmacologic Substance treats Disease or Syndrome Therapeutic or Preventive Procedure treats Disease or Syndrome

68 53 However, if a semantic type is too general, then that type is ignored. Whether or not a semantic type is too general is decided by its hierarchy level. Currently Level 1, 2, 3 (e.g. A1.4.1) in the UMLS semantic network are regarded as too general because the terms in the semantic types in such levels are too broad. STEP 4: Extend the semantic types identified in STEP 3 by following through the ISA relations. Also too general semantic types are ignored. Actually through this process all sub-semantic types of the semantic types in STEP 3 are added to the semantic type list. For example, because [Antibiotic] is a sub-semantic type of [Pharmacologic Substance], [Antibiotic] is added. The four semantic types in STEP 3 are extended to 15 types through this process as shown in Table 7. These semantic types (ST_A_can_ext) are used for the semantic types of A terms as a category restriction.

69 54 Table 7: Extended semantic types through tracking ISA relations Drug Delivery Device Indicator, Reagent, or Diagnostic Aid Antibiotic Biologically Active Substance Pharmacologic Substance Chemical Viewed Functionally Immunologic Factor Receptor Biomedical or Dental Material Therapeutic or Preventive Procedure Vitamin Hormone Enzyme Hazardous or Poisonous Substance Neuroreactive Substance or Biogenic Amine STEP 5: Because there must exist at least one relationship between A terms and B terms, Bio-SbKDS should check if there is at least one relationship between ST_B (the semantic types for B concepts in STEP 2) and ST_A_can_ext (the semantic types for A concepts obtained in STEP 4). For example, there are no relationships for the three pairs in Table 5. First, for each semantic type for B terms Bio-SbKDS checks if there exists at least one relationship with any of the semantic types of A terms. If a semantic type for B terms does not have any relationship with any of the semantic types of A terms, the semantic type is dropped from the semantic type list of B terms. After this process is done with the semantic types of B terms, the same process is performed for the semantic types

70 55 of A terms. These processes are called mutual qualification. During the mutual qualification procedure, Bio-SbKDS simultaneously checks if the two semantic type sets (for A terms and B terms) pass the predefined relation filter between A terms and B terms. This filter is shown in Table 9. Table 10 shows the two semantic type sets for B concepts and A concepts that are automatically generated using only the initial relations and the relation filters. Table 8: The Semantic Types that have no relationship Semantic Types for B concepts Invertebrate Geographic Area Organic Chemical Semantic Types for A concepts Neuroreactive Substance or Biogenic Amine Neuroreactive Substance or Biogenic Amine Drug Delivery Device Table 9: The Relation Filter between A concepts and B concepts interacts_with produces complicates

71 56 Table 10: The Semantic Types as Category Restrictions for B Concepts and A Concepts Semantic Types as Category Restrictions for A Concepts Indicator, Reagent, or Diagnostic Aid Antibiotic Biologically Active Substance Pharmacologic Substance Chemical Viewed Functionally Immunologic Factor Receptor Biomedical or Dental Material Therapeutic or Preventive Procedure Vitamin Hormone Enzyme Hazardous or Poisonous Substance Neuroreactive Substance or Biogenic Amine Semantic Types as Category Restrictions for B Concepts Cell Function Carbohydrate Eicosanoid Steroid Mental or Behavioral Dysfunction Element, Ion, or Isotope Organophosphorus Compound Congenital Abnormality Amino Acid, Peptide, or Protein Organism Function Pathologic Function Organ or Tissue Function Chemical Viewed Structurally Nucleic Acid, Nucleoside, or Nucleotide Organic Chemical Cell or Molecular Dysfunction Inorganic Chemical Acquired Abnormality Molecular Function Neoplastic Process Mental Process Genetic Function Lipid Experimental Model of Disease Physiologic Function

72 57 STEP 6: In order to collect B term candidates, the starting concept C is searched against Medline. Here, we should consider what B terms should be. Because there should be some meaningful semantic relationships between B terms and C term (for B terms to be a bridge between A terms and C term), B terms should be the major topics (concepts) of the documents by the keyword searching against Medline. Therefore, we collect only MajorTopic MeSH terms from the downloaded documents and calculate their counts. The rationale to consider the counts of B candidates here is that we try to find something (as A concepts) that is strongly associated with C concepts. STEP 7: B term category restrictions, which consist of semantic types obtained in STEP 5, are applied to the MeSH terms extracted in STEP 6. Also too general MeSH terms are excluded. The top N terms are selected as B concepts (Currently, N is 5). Table 11 shows the top 5 B terms based on their counts for Raynaud Disease Fish Oils case. Blood Viscosity is ranked in the first, which is the one Swanson found manually. STEP 8: Unlike the initial search based on the starting concept C in STEP 6, Bio- SbKDS searches all top B terms against Medline. The B terms are ranked by the counts of the terms. On searching, the same date range is used as the initial keyword. However, the documents, relevant to C concept should be excluded. Thus, the search keyword would be B term AND Date_Range NOT C term. Similar to STEP 6, only MajorTopic MeSH terms are collected. A sample keyword to be searched is the following: "Blood Viscosity"[MAJOR] 1983[dp]:1985[dp] NOT "Raynaud Disease"[MeSH]

73 58 Table 11: TOP 5 bridge concepts with their counts MajorTopic MeSH terms Count Blood Viscosity 22 Quinazolines 10 Pyridines 8 Vinyl Chloride 8 Imidazoles 8 STEP 9: A term category restrictions, which consist of the semantic types obtained in STEP 5, are applied to the MeSH terms extracted in STEP 8. Also too general MeSH terms are excluded. In addition to those qualifications, Bi-Decision Maker (discussed in Section 3.4.3) determines if the MeSH terms are appropriate to A concepts. Through these processes, A concept candidates are generated. STEP 10: Because we try to find only novel C-A relationships, the system eliminates A candidates that already have some relationships with C concept by searching Medline; if C and A concepts co-occur together in the biomedical literature, those A concepts are dropped from the candidate list. From the A candidates, the top N a as A concepts are selected based on their weights from the B term MeSH Term Qualification Many MeSH terms are too general and those terms may not be very useful for mining the undiscovered public knowledge. In order to find these too general MeSH

74 59 terms, we analyzed all the Medline documents from 1994 to 2004, which is more than 5.3 millions, calculated the counts of every MeSH term in these 5.3 million documents. Figure 15 shows the counts of MeSH terms assigned to Medline documents. The Y axis indicates the counts of MeSH terms and the X axis does MeSH terms. Only a very small number of MeSH terms are apparently shown on X axis (actually all MeSH terms are used). Every MeSH term, of course, is not equally used to index documents but some of MeSH terms are extensively used to do so. The percentages of the top 100 MeSH terms in the total number of MeSH term usages are more than 38%. Here, we treat those MeSH terms in the red oval as too general. Figure 15: The Counts of MeSH Terms Assigned to MEDLINE Articles

75 Bi-Decision Maker The biggest challenging problem in this method is how to reduce the big potential candidates for the B terms. Because a single B term may involve lots of A candidates, it is crucial to reduce the number of B terms. Although the semantic types derived from the initial relations as category restriction, can constraint B and A terms, every term in those semantic types is not always appropriate to B or A concepts. For example, if the starting concept is Raynaud disease, we expect B concepts may be the symptoms of the disease, something to cause the symptoms, or something directly to cause the disease. Consequently, we expect A concepts should be something to relieve the symptoms, to inhibit the factors to cause the symptoms. The relationship of B and A should be complementary to the relationship to B and C. In a word, if C is a human disease, we expect A concepts would be something positive to human being while B concepts would be something negative to human being. Therefore, using the properties of B and A concepts, we can further qualify B and A terms. In order to determine if a MeSH term is positive or negative in the complementary semantic relationship pairs between the starting concept C, intermediate concept B and target concept A, the definitions of MeSH terms are analyzed. Currently, our method detects some keywords that have different weights (-5 to 5); negative weight means negative and positive weight positive. For example, a B candidate Nifedipine, which is actually ranked in the first before Bi-Decision qualification process, is dropped out after the process because some terms in the definition, underlined and italic in Table 12, are positive terms. Blood Viscosity is decided as negative because morbidity and disorder are negative terms.

76 61 Bi-Decision Maker, does not always identify all MeSH terms using the definitions because for around 6% of MeSH terms, their definitions are not provided by NLM. Secondly, many MeSH terms are between negative and positive. Table 12: The Definitions of MeSH terms Nifedipine A potent vasodilator agent with calcium antagonistic action. It is a useful antianginal agent that also lowers blood pressure. Blood Viscosity The internal resistance of the BLOOD to shear forces. The in vitro measure of whole blood viscosity is of limited clinical utility because it bears little relationship to the actual viscosity within the circulation, but an increase in the viscosity of circulating blood can contribute to morbidity in patients suffering from disorders such as SICKLE CELL ANEMIA and POLYCYTHEMIA Combinational Search Method Because we consider the counts of B candidates, rank the candidates based on the counts and select the top N b B candidates, we expect B concepts are strongly related to C concept. Here, the system searches all the combinations of those B terms against Medline. A weight, the sum of the counts of elements of the combination, is assigned to each combination. The weights of the combinations are inherited to the MeSH terms that are collected by the combination search. The rationale of such searching is that if there is something (as A concept) that is related to more than one object (B concepts), the A concept may have stronger relationships with C concept. In other words, if a substance (A concept) can relieve or inhibit more than one symptom, the substance should be regarded

77 62 as important (or ranked high). Table 13 shows the 7 combinational search keywords and their weights for the Raynaud Disease Fish Oils case; the number of B terms is 3. Table 13: The Combination Search Keywords and their Weights Combination Search Keywords "Blood Viscosity"[MAJOR] "Quinazolines"[MAJOR] 1974[dp]:1985[dp] NOT "Raynaud+Disease"[MeSH] "Blood+Viscosity"[MAJOR] "Quinazolines"[MAJOR] "Piperidines"[MAJOR] 1974[dp]:1985[dp] NOT "Raynaud+Disease"[MeSH] "Blood+Viscosity"[MAJOR] "Piperidines"[MAJOR] 1974[dp]:1985[dp] NOT "Raynaud+Disease"[MeSH] Combined Weights "Quinazolines"[MAJOR] "Piperidines"[MAJOR] 1974[dp]:1985[dp] NOT "Raynaud+Disease"[MeSH] 18 "Blood Viscosity"[MAJOR] 1974[dp]:1985[dp] NOT "Raynaud+Disease"[MeSH] 23 "Quinazolines"[MAJOR] 1974[dp]:1985[dp] NOT "Raynaud+Disease"[MeSH] 10 "Piperidines"[MAJOR] 1974[dp]:1985[dp] NOT "Raynaud+Disease"[MeSH] 8

78 Document Clustering using Bipartite Graph Representation (COBRA) In this section, we present our novel clustering method that integrates a bipartite graph representation of documents and the mutual refinement strategy. We call our method COBRA (Clustering Ontology-enriched Bipartite Graph Representation with Mutual Refinement Strategy). COBRA consists of the following three main steps: (1) representing documents as a bipartite graph between the documents and co-occurrence concepts in the documents, (2) initial clustering by grouping co-occurrence concepts, and (3) applying the mutual refinement strategy to the initial clustering results Bipartite Graphical Representation for Documents through Concept Mapping The first step of all document clustering methods is to convert documents into a proper format. We recognize documents as a set of concepts that have their complex internal semantic relationships. We assume that documents could be clustered based on the significant semantic features (i.e., co-occurrence concepts) in the documents. Therefore, we represent a set of documents as a bipartite graph to disclose the relationships between the documents and co-occurrence concepts among the documents. The complete procedure of constructing a bipartite graph from a set of documents requires the following three steps: (1) the concept mapping in documents, (2) the selection of corpus-level co-occurrence concepts as significant semantic features of the documents, and (3) the construction of a bipartite graph representation with significant semantic features. First, the concept mapping matches the terms in each document to the Entry terms in MeSH and then maps the selected Entry terms into MeSH Descriptors. We

79 64 now explain the process. Instead of searching all Entry terms in the MeSH against each document, we select 1 to 3-gram words as the candidates of MeSH Entry terms after removing all stop words from each document. We select those candidate terms that only match with MeSH Entry terms. We then replace those semantically similar Entry terms with the Descriptor term to remove synonyms. We next filter out some MeSH Descriptors that are too general (e.g. HUMAN, WOMEN or MEN) or too common in MEDLINE articles (e.g. ENGLISH ABSTRACT or DOUBLE-BLIND METHOD); see Section for details. We assume that those terms do not have distinguishable power in clustering documents. Hence, we have selected a set of only meaningful corpus-level concepts, in terms of MeSH Descriptors, representing the documents. We call this set Document Concept Set (DCS), where DCS = {C 1, C 2,, C n } and C i is a corpus-level concept. In the second step, significant semantic features to be used as a basis for clustering are generated from a set of documents. These significant semantic features indicate the semantic components or the intrinsic meanings of the whole document collection. In order to extract significant semantic features from a set of documents, we take advantage of term co-occurrence in documents. Co-occurrence terms have long been used in document retrieval systems to identify indexing terms during query expansion [Conrad and Utt, 1994] [Jenssen et al, 2001]. For example, in the biomedical domain, cooccurrence has been used to capture potential relationships between genes, proteins and drugs in biomedical literature [Wren, 2004]. In this way, we use co-occurrence concepts as significant biomedical semantic features in the biomedical literature that have been regarded as more important than single term [Hristovski et al, 2001] [Jenssen et al, 2001]

80 65 [Perez-Iratxeta, 2002]. Given DCS, we define a co-occurrence concept, CC = {C i, C j }, where C i and C j are two corpus-level concepts in the DCS. A set of co-occurrence concepts for a document set V D is represented by V CC = {CC 1, CC 2, CC 3,, CC m }, where m is the number of corpus-level co-occurrence concepts for V D. In order to select co-occurrence concepts from many concept pairs, the Mutual Information [Fano, 1961] is employed. According to the information theory, the Mutual Information of two random variables x and y indicates the mutual dependence of x and y by comparing the joint probability of x and y with the probabilities of x and y independently. A higher Mutual Information between x and y means that x (or y) is nonrandomly associated with y (or x). In this way, the Mutual Information has been widely used to identify lexical dependencies [Church and Hanks, 1989], e.g., in finding functional genomic clusters in RNA expression data [Butte and Kohane, 2000] and in extracting features from large text databases [Conrad and Utt, 1994] [Wren, 2004]. The Mutual Information is defined as follows. Pxy (, ) f( xy, ) Mutual Information( x, y) = log2 log 2 (1) Px ( ) Py ( ) f( x) f( y) Here, f(x, y) is the co-occurrence count, which is defined as the number of documents that contain both x and y concepts. Because the Mutual Information may be unstable if f(x,y) is very small, we consider only concept pairs with f(x,y)>0.05*n, where N is the number of documents [Church and Hanks, 1989]. Co-occurrence concepts are mirrored as edges on the graph and their co-occurrence counts are used as edge weights.

81 66 In addition, the use of co-occurrence concepts prevents the noise concepts, which are unrelated to the topic of a document but are found in the document (e.g. Cancer in this paper), from affecting the similarity measure process during document clustering. For example, suppose a document D x, that belongs to a document cluster (say DC1), has the concepts {C 1, C 2, C 3, C 4 }. C 4, however, is not relevant to the topic of D x (as this paper contains many cancer terms) and C 4 is a very important concept in other document cluster (say DC2). In that case, traditional document clustering approaches may assign D x to the wrong document cluster (i.e. DC2) because D x contains C 4 as important concepts of DC2. However, if we consider co-occurrence concepts, the concept pairs with C 4 (as co-occurrence term candidates) would not become co-occurrence concepts due to their very low frequencies over documents. Thus, the irrelevant concept C 4 may be removed from the term set of D x and would not lead D x to the irrelevant document cluster DC2. In the third step, we construct a bipartite graph. A bipartite graph G for a given set V D of n documents and a set V CC of corpus-level co-occurrence concepts is represented as G = (V D +V CC, E), where E indicates the relationships between V D and V CC. Weights can be optionally specified on edges. In that case one should provide a sophisticated weight scheme to measure the contribution of concepts to each document. However, such a weight scheme may not be appropriate especially for small sized documents, such as Medline abstracts. In addition, the scheme requires V D * V C complexity. Thus, we draw an un-weighted bipartite graph. Figure 16 illustrates this third step. In this figure, a set of documents is represented as a bipartite graph. The italic letters (Q & W) indicate they are irrelevant

82 67 concepts in their documents. They are excluded on the graph on co-occurrence concept detection due to very low frequencies. Figure 16: A Sample Bipartite Graph between Documents and Corpus-level Cooccurrence Concepts So far, we have presented the process of constructing a bipartite graph from a set of documents. The primary advantage of using the bipartite graphical representation of documents is that each document is semantically associated with the corpus level cooccurrence concepts as significant semantic features, as shown in Figure 16. Thus, the representation method provides the explanation for document categorization in the twodimensional space. This property is contrasted to the vector space representation that uses the N-dimensional spaces based on all the selected terms. Thus, visualization or the

83 68 explanation for document categorization is much easier in our representation than in the vector space model Initial Clustering by Combining Co-Occurrence Concepts Here, COBRA generates initial clusters by combining co-occurrence concepts. Since similar documents share the same or semantically similar co-occurrence concepts, COBRA combines co-occurrence concepts and then clusters documents based on their similarities to k co-occurrence concept groups. On combining them there are two ways to measure the similarity between co-occurrence concepts: their semantic similarity within the MeSH concept hierarchy (sim cc ) and their documents coverage similarity (sim doc ), shown in Equations (2) and (3), respectively. The semantic similarity between two cooccurrence concepts (CC i & CC j ) in the concept hierarchy (sim cc ) is the average similarity p of four concept pairs (i.e., the product of CC i and CC j ). The C in Equation 2 indicates the set of parent concepts of C concept in the concept hierarchy. The document coverage similarity (sim doc ) is the overlap rate of document coverage zones by CC i and CC j. This similarity is based on the information theoretic based measure [Lin, 1998]. Formally, it is defined as the ratio between the amount of information needed to state the commonality of co-occurrence concepts and the information needed to fully describe what the cooccurrence concepts are in terms of the number of relevant documents.

84 69 sim ( CC, CC ) = cc i j p p i j p p C CC, C CCCi Cj i i j j CC + CC i C C j (2) sim docscc docs i CC j ( CC, CC ) = docs docs doc i j CC i CC j (3) Here, docscc i implies a set of documents that contain CC i co-occurrence concept We integrate the two measures with weights into the Co-occurrence Concept Similarity (CCS). Given two co-occurrence concepts (CC i and CC j ), the CCS is defined in Equation (4) as follows: (λ=0.5 in the experiments) sim ( CC, CC ) = λ sim ( CC, CC ) + (1 λ) sim ( CC, CC ), (4) i j cc i j doc i j with λ [0,1] as weights Based on the average-link clustering algorithm that uses the integrated similarity function, COBRA combines co-occurrence concepts until we get k co-occurrence concept groups. The cost function of the average-link clustering algorithm is shown in Equation (5). 1 Si Sj CCi Si CC j S j Here, S i is a co-occurrence concept cluster. sim( CC, CC ) (5) i j For initial document clusters COBRA links each document to k co-occurrence concept clusters based on its similarity to k clusters. This similarity is simply measured by the number of times co-occurrence concepts in each document appear in each of k clusters. A document is assigned to the most similar co-occurrence concept cluster. For

85 70 example, suppose there are two co-occurrence clusters: (S 1 ={CC 1, CC 2, CC 3 }, S 2 ={CC 4, CC 5 }) and a document has CC 2, CC 3, and CC 5. Then, the document is assigned to S 1. Figure 17 shows the pseudo-code of the initial clustering algorithm..algorithm: Initial Clustering Input: k, V CC, V D Output: V DC, k groups of V CC Place each object CC i of V CC to S i as its own cluster, creating the list of clusters L = {S 1,S 2,...,S m } For L -k Find the most similar pair of clusters (i.e. Find Max (Eq.(5)) Merge S i and S j into S ij and Remove them from L End For /* Now, L has k co-occurrence concept clusters */ Assign each document to the cluster that has the largest number of co-occurrences. Figure 17: The Initial Clustering Algorithm

86 Mutual Refinement Strategy for Document Clustering Through the procedures discussed above, COBRA generates initial clusters. However, this clustering cannot correct erroneous decisions as in hierarchical clustering methods. In other words, once clustering procedure is performed, the clustering results are never refined further even if the procedures are based on local optimization. In our method, COBRA purifies the initial document clusters by mutually refining k co-occurrence concept groups and k document clusters. The basic idea of the mutual refinement strategy for document clustering is stated: A co-occurrence concept should be linked to the document cluster to which the co-occurrence concept makes the best contribution. A document cluster should be related to co-occurrence concepts that make significant contributions to the document cluster. For this mutual refinement strategy we draw another bipartite graph from k document clusters to a set of co-occurrence concepts. Given the graph G = (V DC +V CC, E), V DC is a set of k document clusters (V DC = {DC 1, DC 2,, DC k }), V CC is a set of cooccurrence concepts (V CC = {CC 1, CC 2, CC 3,, CC m }), and E is the relationships between V DC and V CC. We specify weights on edges so that we measure the contribution of co-occurrence concepts to each document cluster. This contribution is defined as the ratio between the amount of information needed to state the co-occurrence concepts in a document cluster and the total information in the document cluster in terms of the number of documents. The above statement is mathematically rendered as

87 72 cntrb( CC, DC ) i k CCi Size( docs ) DC k = (6) Size( DC ) k Here, Size function returns the number of relevant documents, docs CCi DCk represents a set of documents with co-occurrence concept (CC i ) in the document cluster (DC k ). After each refinement, using k new co-occurrence concept groups, each document is reassigned to the proper document cluster in the same way used for generating the initial clusters. This mutual refinement iteration continues until no further changes occur on the document clusters. Figure 18 shows the pseudo-code of the mutual refinement strategy algorithm.

88 73 Algorithm: Mutual Refinement Strategy Input: k, Documents, V CC, V DC Output: k V CC groups, V DC Do //iteration For Each CC i V CC Select a document cluster (DC) to which the CC i makes the most contribution (i.e. Max (Eq.(6)) End For /* Note the For loop polishes the k co-occurrence concept clusters */ Assign each document to the cluster that has the largest number of co-occurrences. Loop While NoChangeInDocumentClusters Figure 18: The Mutual Refinement Strategy Algorithm

89 74 CHAPTER 4: EXPERIMENTAL EVALUATION In this chapter, document clustering, text summarization, and a semantic version of Swanson s ABC model are experimentally evaluated on MEDLINE articles. 4.1 Document Clustering using Scale-free Graphical Representation Document Sets In order to measure the effectiveness of Clustering with Ontology-enriched Graphical Representation for documents (COGR), we conducted extensive experiments on public MEDLINE abstracts. For the extensive experiments, first we collected document sets related to diseases from MEDLINE. We used MajorTopic tag along with the disease MeSH terms as queries to MEDLINE (see Section for the tag in detail). Table 14 shows the document sets used in our experiments. After retrieving the data sets, we generate various document combinations whose numbers of classes are 2 to 9 by randomly mixing the document sets in Table 14. The document sets used for generating the combinations are later used as answer keys on the performance measure. We emphasize that our corpora sizes are much bigger than those of other document clustering studies such as [Beil et al, 2002][Larsen and Aone, 1999][Li et al, 2004][Pantel and Lin, 2002][Steinbach, 2000][Zhao and Karypis, 2002][Zhong and Ghosh, 2003][zu Eissen et al, 2005]; each study used, respectively, at most 8.3k, 20k, 8.6k, 19k, 3k, 11k, 8.6k, and 1k documents for their experiments.

90 75 Table 14: The Document Sets and Their Sizes Document Sets ID No. of Docs Gout Gt 642 Chickenpox Ghk 1,083 Raynaud Disease RD 1,153 Insomnia Ins 1,352 Jaundice Jn 1,486 Hepatitis B Hpt 1,815 Hay Fever HF 2,632 Kidney Calculi KS 3,071 Impotence Imp 3,092 Age-related Macular Degeneration AMD 3,277 Migraine Mg 4,174 Otitis Ot 5,233 Osteoporosis Ost 8,754 Osteoarthritis OA 8,987 Parkinson Disease Pk 9,933 Alzheimer Disease Alz 18,033 Diabetes Type 2 Diab 18,726 AIDS AIDS 19,671 Depressive Disorder Dep 19,926 Prostatic Neoplasm Pros 23,639 Coronary Heart Disease CHD 53,664 Breast Neoplasm Bre 56,075

91 76 Table 15: List of Test Corpora Generated from the Base Data Sets Corpus Name Corpus ID Corpus Size 2_Mg-Alz C2.1 22k 2_Ot-AMD C2.2 9k 2_Bre-CHD C k 3_OA-Ost-Pk C3.1 28k 3_AMD-Mg-Ot C3.2 13k 3_Pros-Bre-CHD C k 4_Alz-AMD-Ot-Ost C4.1 35k 4_Dep-AIDS-Alz-Diab C4.2 76k 4_Ost-AMD-Mg-Ot C4.3 21k 5_AIDS-Alz-AMD-Ot-Ost C5.1 55k 5_Alz-AMD-Mg-Ost-Ot C5.2 39k 5_HF-KS-Imp-AMD-Mg C5.3 16k 6_AMD-Mg-Ot-OA-Ost-Pk C6.1 40k 6_Ins-Jn-Hpt-HF-KS-Imp C6.2 13k 6_Pros-Ost-Alz-AIDS-Dep-Diab C k 7_Jn-Hpt-HF-KS-Imp-AMD-Mg C7.1 20k 7_Chk-Jd-Hpt-HF-KS-AMD-Mg C7.2 18k 7_Ost-Pk-Alz-AIDS-Dep-Diab-Pros C k 8_KS-Imp-Gt-Chk-RD-Ins-Jn-Hpt C8.1 14k 8_Mg-Gt-Chk-Jn-Hpt-HF-KS-AMD C8.2 18k 8_OA-Ost-Pk-Alz-AIDS-Dep-Diab-Pros C k 9_Mg-Gt-Chk-Rd-Jn-Hpt-HF-KS-AMD C9.1 19k 9_Mg-Chk-Ins-Jn-Hpt-HF-KS-Imp-AMD C9.2 22k 9_Ot-OA-Ost-Pk-Alz-AIDS-Dep-Diab-Pros C k Each corpus name in Table 15 indicates the number of document sets (i.e. k) used for the corpus generation and what document sets are used (each document set ID (see Table 14) is delimited by - ). The format of corpus ID is [Ck.n], where k is the number of document sets and n is a sequence number for a different combination.

92 Evaluation Method In general, clustering systems have been evaluated in three ways. First, document clustering systems can be assessed based on user studies whose main purpose is to measure the user s satisfaction with the output of the systems. This kind of evaluation has been widely used especially by IR community because the community carries out goaloriented investigation. This evaluation method can be used to demonstrate the effectiveness of clustering search engine results to support information access tasks on the web [Zamir and Etzioni, 1998]. Second, the objective functions of clustering algorithms have been used to evaluate the algorithms. For example, the sum-squared error that K-means minimizes for all objects can be applicable for clustering evaluation. This method is normally used when the classes are unknown or the balance 12 of a test corpus is very low. Finally, clustering algorithms can be evaluated by comparing clustering output with known classes as answer keys. There have been a number of comparison metrics, such as mutual information metric [Xu and Gong, 2004], misclassification index (MI) [Zeng et al, 2002], purity [Zhao and Karypis, 2002], confusion matrix [Aggarwal et al, 1999] and F-measure [Larsen and Aone, 1999], and Entropy (see [Ghosh, 2003] for more examples). In our experiment we use misclassification index (MI), F-measure, and cluster purity as clustering evaluation metrics. MI is the ratio of the number of misclassified objects to the size of the whole data set [Zeng et al, 2002]; thus, MI with 0% means the perfect clustering. For example, MI is 12 The balance of a corpus is the ratio of the number of documents in the smallest document class to the number of documents in the largest document class.

93 78 calculated as follows under the situation shown in the Table 16. Note that the total number of objects in classes is the same as the number of objects in clusters. # of misclassified objects 3 MI = = = 3% total # of objects 100 Table 16: Sample Classes and Clustering Output. Each number in the table is the number of objects in its class or cluster Classes Clusters No misclassified objects 3 objects misclassified No misclassified objects F-measure is a measure that combines the recall and the precision from information retrieval. When F-measure is used as a clustering quality measure, each cluster is treated as the retrieved documents for a query and each class is regarded as an ideal query result. Larsen and Aone [Larsen and Aone, 1999] defined overall clustering F-measure as the weighted average of all values for the F-measure as given by the following: for class i and cluster j

94 79 F ni = n i max { F i j } (, ), where the max function is over all clusters, nis the number of documents, and F( i, j) = 2 Recall( i, j) Precision( i, j) Recall( i, j) + Precision( i, j) However, this formula is sometimes problematic; if a cluster has the majority (or even all) of objects, more than a class are matched with only such a cluster for calculating F-measure and some clusters are not matched with any classes (meaning that those clusters are not evaluated in F-measure). Thus, we exclude matched clusters on the process of the max function. In consequence a class is matched with only a cluster that yields the maximum F-measure. The cluster purity indicates the percentage of the dominant class members in the given cluster; the percentage is nothing more than the maximum precision over the classes. For measuring the overall clustering purity, we use the weighted average purity as shown below (for class i and cluster j). Like F-measure, we eliminate matched clusters on the process of the max function. n j Purity = max i { Precision( i, j) }, where n is the number of documents n j Note that the smaller MI implies the better clustering quality while the bigger F- measure and purity indicate the better clustering quality.

95 Experimental Setting In order to evaluate COGR we compare its effectiveness with that of a leading document clustering approach, BiSecting K-means as well as traditional K-means, hierarchical clustering algorithms (single-link, complete-link, and average link), and Suffix Tree Clustering (STC). Two recent document clustering studies showed that BiSecting K-means outperforms both hierarchical clustering methods and K-means on various document sets from TREC, Reuters, WebACE, etc, [Steinbach et al, 2000] [Beil et al, 2002]. A recent comparative study showed CLUTO s vcluster [Karypis, 2003] (with default options) outperforms several model-based document clustering algorithms [Zhong and Ghosh, 2003]. This clustering program is an implementation of Bisecting K-means with optimization for large datasets. In addition, the program use the optimized cluster selection method and a special criterion function that leads to the best overall clustering results in the comparison study of a total seven different clustering criterion functions [Zhao and Karypis, 2002]. According to our previous clustering studies [Yoo and Hu, 2006a] [Yoo and Hu, 2006b] [Yoo and Hu, 2006c], CLUTO s vcluster is significantly superior to the original implementation of Bisecting K-means [Steinbach et al, 2000] in terms of clustering quality and scalability. We provide all the clustering algorithms except STC and COGR with word*document matrixes (i.e. vector representation) as input that are generated by doc2mat Perl script [Karypis, 2003]. For STC, we input both a word string and a concept string (we detected MeSH Entry terms in each string and replaced them with MeSH descriptors).

96 81 The implementations of STC are based on [Zamir and Etzioni, 1998]. We use BiSecting K-means, K-means, and hierarchical clustering algorithms in the CLUTO clustering package [Karypis, 2003]. Because BiSecting K-means and K-means may produce different results every time due to their random initializations, we ran them five times Experiment Results Because the full detailed experiment results are too big to be depicted in this paper, we average the clustering evaluation metric values and show the standard deviations (σ) for them to indicate how consistent a clustering approach yields document clusters (simply, the reliability of each approach). The σ would be a very important document clustering evaluation factor because document clustering is performed in the circumstance where the information about documents is unknown. Figure 19 shows the comparison of MI, Purity, and F-measure for COGR and six traditional approaches excluding hierarchical algorithms. Table 17 summarizes the statistical information about clustering results. From the figure and the table, we notice the following observations: COGR outperforms the nine document clustering methods. COGR has the most stable clustering performance regardless of test corpora, while CLUTO Bisecting K-means and K-means do not always show stable clustering performance. Hierarchical approaches have a serious scalability problem. STC and the original Bisecting K-means have a scalability problem.

97 82 MeSH Ontology improves the clustering solutions of STC. Unexpectedly, the original BiSecting K-means [Steinbach et al, 2000] shows poor performance. Unlike the studies [Steinbach et al, 2000] and [Beil et al, 2002], our experiment results indicate the original BiSecting K-means is even worse than K-means. On the other hand, such a result is also found in [Pantel and Lin, 2002]. This contradiction leads us to deem that the clustering results of BiSecting K-means and K- means heavily depend on document sets used. Table 17: Summary of Overall Experiment Results on MEDLINE Document Sets MI Purity F- measure word strings μ: σ: μ: σ: μ: σ: STC concept strings μ: σ: μ: σ: μ: σ: K- means μ: σ: μ: σ: μ: σ: CLUTO Bisecting Original K-means Bisecting K-means Largest LOS μ: σ: μ: σ: μ: σ: μ: σ: μ: σ: μ: σ: μ: σ: μ: σ: μ: σ: COGR μ: σ: μ: σ: μ: σ: LOS: selecting the cluster (to be bisected) with the least overall similarity and Largest: selecting the largest cluster to be bisected MI: the smaller, the better clustering quality. Purity and F-measure: the bigger, the better clustering quality

98 83 We discern that CLUTO Bisecting K-means and K-means usually have very decent performance. However, for some document collections (C2.1, C4.1, C5.2, and C5.1 in Figure 19) they commonly performed poorly. We believe this is because nondomain specific words in those document collections significantly hinder the similarity detection for documents, since for those collections COGR greatly improves clustering results compared to other document collections.

99 [W]STC [C]STC Ori. BiSec-Kmn COGR BiS Km(lar) BiS Km(LOS) Kmn H(avg.) H(sing.) H(comp.) 9k 13k 13k 14k 16k 18k 18k 19k 20k 21k 22k 22k 28k 35k 39k 40k 55k 76k 109k 110k 119k 128k 132k 133k C2.2 C3.2 C6.2 C8.1 C5.3 C7.2 C8.2 C9.1 C7.1 C4.3 C9.2 C2.1 C3.1 C4.1 C5.2 C6.1 C5.1 C4.2 C6.3 C2.3 C7.3 C8.3 C3.3 C9.3 Corpus Size and ID Figure 19: Comparison of MIs for COGR and Traditional Document Approaches MI (the smaller, the better)

100 85 We observe that COGR has the best performance, yields the most stable clustering results and scales very well. More specifically, COGR shows 45% cluster quality improvement and 72% clustering reliability improvement, in terms of MI, over Bisecting K-means with the best parameters. There are three reasons to support the results. First, COGR uses an ontology-enriched graphical representation that still retains the semantic relationship information about the core concepts of the documents. Second, COGR uses document cluster models that capture the core semantic relationship for each document cluster to categorize documents. Third, as the number of documents to be processed increase, a corpus-level graphical representation at most linearly expands or keeps its size with only some changes on edge weights, while a vector space representation (i.e. document*word matrix) at least linearly grows or increases by n*t, where n is the number of documents and t is the number of distinct terms in documents. In addition to the superiority of COGR over traditional document clustering approaches, one should notice that only COGR supply a meaningful explanation for document clustering as well as the summaries of each document cluster through generated document cluster models. This could be critical for users to understand clustering results and documents as a whole because document clustering is performed in the circumstance where the information about documents is unknown.

101 Text Summarization Because the document clustering and text summarization are merged into a coherent system, the experiment for text summarization uses the same experimental data as the document clustering. Tables 18, 19 and 20 show the experiment results for document clusters called Alzheimer Disease, Parkinson Disease, and Osteoarthritis. We believe that those document cluster models in HVS and Top 7 sentences as summary significantly help users understand the document cluster.

87 Table 18: Experiment Results for Text Summarization: For the Alzheimer Disease document cluster its document cluster model and key sentences as summary are shown.

102 87 Table 18: Experiment Results for Text Summarization: For the Alzheimer Disease document cluster its document cluster model and key sentences as summary are shown. Document Cluster Model (HVS sets) Top 7 Sentences as Summary for the Document Cluster Tau protein extracted from filaments of familial multiple system tauopathy with presenile dementia shows a minor 72-kDa band and two major bands of 64 and 68 kda that contain mainly hyperphosphorylated four-repeat tau isoforms of 383 and 412 amino acids. The central pathological cause of Alzheimer disease (AD) is hypothesized to be an excess of beta-amyloid (Abeta) which accumulates into toxic fibrillar deposits within extracellular areas of the brain.these deposits disrupt neural and synaptic function and ultimately lead to neuronal degeneration and dementia In dementia of Alzheimer type (DAT), cerebral glucose metabolism is reduced in vivo, and enzymes involved in glucose breakdown are impaired in post-mortem brain tissue Alzheimer's disease (AD), a progressive, degenerative disorder of the brain, is believed to be the most common cause of dementia amongst the elderly The fundamental cause of Alzheimer dementia is proposed to be Alzheimer disease, i.e. the neurobiological abnormalities in Alzheimer brain Alzheimer's disease (AD) is a degenerative disease of the brain, and the most common form of dementia Regional quantitative analysis of NFT in brains of non-demented elderly persons: comparisons with findings in brains of late-onset Alzheimer's disease and limbic NFT dementia.

88 Table 19: Experiment Results for Text Summarization: For the Parkinson Disease document cluster its document cluster model and key sentences as summary are shown.

103 88 Table 19: Experiment Results for Text Summarization: For the Parkinson Disease document cluster its document cluster model and key sentences as summary are shown. Document Cluster Model (HVS sets) Top 7 Sentences as Summary for the Document Cluster Because genetic defects relating to the ubiquitin-proteasome system were reported in familial parkinsonism, we evaluated proteasomal function in autopsied brains with sporadic Parkinson's disease The BRAIN TEST, a computerized alternating finger tapping test, was performed on 154 patients with parkinsonism to assess whether the test could be used as an objective tool to evaluate reliably the severity of Parkinson's disease (PD) Brain tissue of 50 patients with morphological confirmed Parkinson's disease (PD), blood samples from 149 patients with clinical parkinsonism and from 96 healthy control subjects were collected Parkinson's disease is one of the most frequent neurodegenerative brain diseases There is some evidence that Parkinson's disease (PD) seems to be a heterogenous and generalized brain disorder reflecting a degeneration of multiple neuronal networks, including somatostatinergic neurons These findings suggest that many of these patients did not have Parkinson's disease but rather rigid-akinetic syndromes associated with degenerative brain disease The cutoff point that best distinguished patients with suspected vascular parkinsonism from patients with Parkinson's disease was a 0.6% level of lesioned brain tissue volume

89 Table 20: Experiment Results for Text Summarization: For the Osteoarthritis document cluster its document cluster model and key sentences as summary are shown.

104 89 Table 20: Experiment Results for Text Summarization: For the Osteoarthritis document cluster its document cluster model and key sentences as summary are shown. Document Cluster Model (HVS sets) Top 7 Sentences as Summary for the Document Cluster Pathological joint events in both inflammatory arthritis and degenerative arthritis are perpetuated by complex cytokine interactions In 8, who had severe osteoarthritis, a bicompartmental ICLH (Imperial College-London Hospital) prosthesis was used; in 12, with moderate arthritis, the medial side of the joint was replaced by a unicompartmental Brigham prosthesis In old scaphoid fractures, the degenerative arthritis begins with an impingement between radial styloid process and proximal pole of the scaphoid, and then reaches the lunocapitate joint. A dorsi flexion by instability is then constant These patients suffered from painful posttraumatic degenerative arthritis after tarsometatarsal joint fracture-dislocation More than 85% of all adult cadavers demonstrate degenerative arthritis of the radial subsesamoid joint OBJECTIVE: Osteoarthritis (OA) is the most common type of arthritis; involvement of joints in the hand is highly prevalent, especially in the elderly Dysfunction of the pisotriquetral joint: degenerative arthritis treated by excision of the pisiform.

A Semantic Model for Concept Based Clustering

A Semantic Model for Concept Based Clustering S.Saranya 1, S.Logeswari 2 PG Scholar, Dept. of CSE, Bannari Amman Institute of Technology, Sathyamangalam, Tamilnadu, India 1 Associate Professor, Dept. of