A Knowledge Management System for Organizing MEDLINE Database

A Knowledge Management System for Organzng MEDLINE Database Hyunk Km, Su-Shng Chen Computer and Informaton Scence Engneerng Department, Unversty of Florda, Ganesvlle, Florda 32611, USA Wth the exploson of bomedcal data, nformaton overload and users nablty of expressng ther nformaton needs may become more serous. To solve those problems, ths paper presents a text data mnng method that uses both text categorzaton and text clusterng for buldng concept herarches for MEDLINE ctatons. The approach we propose s a three-step data mnng process for organzng MEDLINE database: (1) categorzatons accordng to MeSH terms, MeSH major topcs, and the cooccurrence of MeSH descrptors; (2) clusterng usng the results of MeSH term categorzaton; and (3) vsualzaton of categores and herarchcal clusters. The herarches automatcally generated may be used to support users n browsng behavor and help them dentfy good startng ponts for searchng. An nterface for ths underlyng system s also presented. 1. INTRODUCTION MEDLINE, developed by the U.S. Natonal Lbrary of Medcne (NLM), s a database of ndexed bblographc ctatons and abstracts. It contans over 4,600 bomedcal journals [1]. MEDLINE ctatons and abstracts are searchable va PubMed or the NLM Gateway. The NLM produces the MeSH (Medcal Subject Headngs) for the purposes of subject ndexng, catalogng and searchng journal artcles n MEDLINE wth an annual update cycle. MeSH conssts of descrptors (or man headngs), qualfers (or subheadngs), and supplementary concept records. It contans more than 19,000 descrptors whch are used to descrbe the subject topc of an artcle. It also provdes less than 100 qualfers whch are used to express a certan aspect of the concept represented by the descrptor. MeSH terms are arranged both alphabetcally and n a herarchcal tree, n whch specfc subject categores are arranged beneath broader terms. MeSH terms provde a consstent way of retrevng nformaton regardless of dfferent termnology used by the authors n the orgnal artcles. By usng MeSH terms, the user s able to narrow the search space n MEDLINE. As a result, by addng more MeSH terms to the query, retreval performance may be mproved [2]. However, there are nherent challenges, as well. There may be nformaton overload [3], and users may be unable to express ther nformaton needs, n order to take full advantage of the MEDLINE database. MEDLINE contans over 12 mllon artcle ctatons. Begnnng n 2002, t began to add over 2,000 new references on a daly bass [1]. Although the user may be able to lmt the search space of MEDLINE wth MeSH terms, keyword searches often result n a long lst of results. For nstance, when the user queres the term Parknson s Dsease by lmtng t to the MeSH descrptors, PubMed returns over 21,000 results. Here, there s a problem of nformaton overload, wth the user havng dffculty fndng relevant nformaton. The nablty of users to express nformaton needs may become more serous, unless users have a precse knowledge n ther area of nterest, or an understandng of MeSH and ts structure. The use of common abbrevatons, techncal terms, and synonyms n bomedcal artcles prevents users from artculatng ther nformaton needs accurately. To avod the vocabulary problem, MeSH may be used. However, t s dffcult for an unfamlar user to locate approprate descrptors and/or qualfers, snce MeSH s a very complex thesaurus. Furthermore, new terms are added, some are modfed, and others are removed each year as bomedcal felds change. An mprecse query usually results n a long lst of rrelevant hts [4]. Under such crcumstances, a better mechansm s needed to organze nformaton n order to help users explore wthn an organzed nformaton space [5]. In order to arrange the contents n a useful way, text categorzaton and text clusterng have been researched extensvely. Text categorzaton s a bolng down of the specfc content of a document nto a set of one or more pre-defned labels [6]. Text clusterng can group smlar documents nto a set of clusters based on shared features among subsets of the documents [4], [7]. In ths paper, we present a text data mnng method that uses both text categorzaton and text clusterng for buldng a concept herarchy for MEDLINE ctatons. The approach we propose s a three-step data mnng process for organzng MEDLINE database: (1) categorzatons accordng to MeSH terms, MeSH major topcs, and the co-occurrence of MeSH descrptors, (2) clusterng usng the results of MeSH term categorzaton, and (3) vsualzaton of categores and herarchcal clusters. The herarches

automatcally generated may be used to support users n browsng behavor as well as help them dentfy good startng ponts for searchng. An nterface for ths underlyng system s also presented. 2. METHODS In ths Secton, we wll explan the data mnng method proposed n detal. We used MySQL to store MEDLINE ctatons and addtonal data that was generated by the data mnng process. 2.1 Data Collecton For the followng experment, we extracted a total of 1,736 ctatons encoded n XML (extensble Markup Language) from the query Secondary Parknson Dsease, lmtng the results to the MeSH major topc feld and to ctatons wth abstracts n MEDLINE. 2.2 Text Categorzaton Categorzaton refers to an algorthm or procedure whch results n the assgnment of categores to documents [6]. We chose the MeSH major topc, the MeSH descrptor and qualfer, and a co-occurrence of MeSH descrptors as a feature to be used n classfcaton. To categorze the collecton accordng to the selected features, we frst parsed the data collecton encoded n XML usng SAX (Smple API for XML). After extractng the MeSH major topcs, the MeSH descrptors, and the co-occurrence of MeSH descrptors for each ctaton, we nserted the data nto the correspondng MySQL tables. 2.3 Text Clusterng usng the Results of MeSH descrptor Categorzaton Snce many MeSH terms may be assgned to a ctaton and vce versa, categorzaton wth the MeSH terms or the co-occurrence of MeSH terms often results n a large lst or herarchy. Some categores may contan a large number of documents. Smply lstng categores assocated wth documents s nadequate for organzng data [6]. To allevate ths problem, the approach we propose here s to cluster the results of MeSH descrptor categorzaton usng the herarchcal Self-Organzng Map (SOM). We chose only those MeSH descrptor categores whose document frequences are over a predetermned threshold for clusterng. Document frequency s the number of documents n whch a term occurs. Terms are extracted and selected usng category dependent document frequency thresholdng from the categores chosen. There are two ways that document frequency s calculated: category ndependent term selecton and category dependent term selecton [8]. In category dependent term selecton, document frequency of each term s computed from all the documents n the collecton and the selected set of terms are used on each category. In category ndependent term selecton, document frequency of each term s calculated from only those documents belongng to that category. Thus, dfferent sets of terms are used for dfferent categores. After the feature selecton and extracton, and the SOM clusterng, a concept herarchy s obtaned, by relyng on the MeSH descrptors for the top layer, and by usng feature vectors extracted from the ttles and abstracts for the sub-layer. 2.3.1 Feature Extracton and Selecton To produce a concept herarchy usng the SOM, documents must be represented by a set of features. For ths purpose, we use full-text ndexng to extract a lst of terms (words or phrases). The nput vector s constructed by ndexng the ttle and abstract elements of the collecton. We then weght these terms usng the vector space model n Informaton Retreval [9]. In the vector space model, documents are represented as term vectors usng the product of the term frequency (TF) and the nverse document frequency (IDF). Each entry n the document vector corresponds to the weght of a term n the document. We used normalzed TF x IDF term weghtng scheme, best fully weghted scheme [9], so that longer documents are not gven more weght and all values of a document vector are dstrbuted n the range of 0 to 1. Thus, weghted word hstogram can be vewed as the feature vector descrbng the document [10]. The preprocessng procedure s manly dvded nto two stages: noun phrase extracton and term weghtng. In the noun phrase extracton phase, we frst fetched the MEDLINE dentfer, the ttle and abstract elements from the collecton and then tokenzed the ttle and abstract elements based on Penn Treebank tokenzaton scheme to detect sentence boundares, and to separate extraneous punctuatons from the nput text. The MEDLINE dentfer was used as a document dentfer. We then automatcally assgned part of speech tags to words reflectng ther syntactc category by usng the rulebased part of speech tagger [11]. After recognzng the chunks that consst of noun phrases from the tagged text, we extracted a set of noun phrases for each ctaton. At ths stage, we removed common terms by consultng a lst of 906 stop words. We computed document frequency of all terms usng category dependent term selecton for those MeSH descrptor categores whose document frequences were over a predetermned threshold (n ths experment, greater than 100 tmes). We then elmnated terms from the feature space whose

document frequency was less than a predetermned threshold (n ths experment, less than 10 tmes). Fnally, we weghted the terms ndexed usng the best fully weghted scheme [9], and assgned correspondng term weghts to each document for each category selected. Thus, the weghted term vector set can be used as the nput vector set for the SOM. 2.3.2 Constructon of a Concept Herarchy Document clusterng s defned as groupng smlar documents nto a cluster. To mprove retreval effcency and effectveness, related documents should be collected together n the same cluster based on some noton of smlarty The Self-Organzng Map s an unsupervsed learnng neural network algorthm for the vsualzaton of hgh-dmensonal data. The SOM defnes a mappng from the nput data space onto a two-dmensonal array of nodes. Every node s represented by a model vector, also called reference vector, m = [m 1, m 2,, m n ], where n s nput vector dmenson. Our algorthm s dfferent from other SOM-varant algorthms, n that each sub-layer SOM dynamcally reconstructs a new nput vector from an upper-level nput vector. The followng algorthm descrbes how to construct a subject-specfc concept herarchy. 1. Intalze network by usng the subject feature vector as the nput vector: Create a twodmensonal map and randomly ntalze model vectors m n the range of 0 to 1 to start from an arbtrary ntal state. 2. Present nput vector n sequental order: Cyclcally present nput vector x(t), the weghted subject nput vector of an n-dmensonal space, to all nodes n the network. Each entry n the nput vector corresponds to the weght of a term n the document. Zero means the term has no sgnfcance n the document or t smply does not exst n the document. 3. Fnd the wnnng node by computng the Eucldean dstance for each node: In order to compare the nput and weght vectors, each node computes the Eucldean dstance between ts weght vector and the nput. The smallest Eucldean dstance dentfes the best-matchng node, whch s chosen as the wnnng node for that partcular nput vector. The best-matchng node, denoted by the subscrpt c, s x m = mn{ x m }. c 4. Update weghts of the wnnng node and ts topologcal neghborhoods: The update rule for the model vector of node s [ x( t) m ( )] m ( t + 1) = m ( t) + α ( t) h ( t) t, where t s dscrete-tme coordnate, (t) s adaptaton coeffcent, and h c (t) s neghborhood functon, a smoothng kernel centered on the wnng node. 5. Repeat steps 2-4 untl all teratons have been completed. 6. Label nodes of the traned network wth the noun phrases of the subject feature vectors: For each node, we determned the dmenson wth the greatest value, and labeled the node wth a correspondng term for that node, and then labeled aggregate nodes wth the same term nto groups. Thus, the subject-specfc top-ter concept map s generated. 7. Repeat steps 1-6 by constructng new nput vector for each grouped concept regon: For each grouped concept regon contanng more than k documents (e.g. 100), recursvely create a SOM and repeat steps 1-6. At ths pont new nput feature vector s dynamcally created by selectng only terms that are contaned n the concept regon from the upper-level feature vector. For each MeSH descrptor category contanng more than 100 documents, we generated a concept herarchy usng the SOM, lmtng the maxmum level of herarchy to 3. We bult a 10 x 10 SOM, and presented each nput vector 100 tmes to the SOM. We then recursvely bult the sub-layer concept herarchy by tranng a new 10 x 10 SOM wth a new nput vector, whch s dynamcally constructed by selectng only a document feature vector contaned n the concept regon from the upper-level feature vector. The concept herarchy generated contans two knds of nformaton: category labels extracted from the MeSH descrptors for the top-level, and the concept herarchy usng the SOM for the sub-layer. We nserted ths nformaton nto the MySQL database to buld an nteractve user nterface. 2.4 Results For the results of categorzaton, we extracted 2,210 dstnct MeSH descrptors, 70 dstnct MeSH qualfers, 269 dstnct MeSH major topcs, and 60,192 co-occurrng MeSH descrptors from the collecton. On average, each ctaton n the collecton contans 14 MeSH descrptors, 10 MeSH qualfers, and 4 MeSH major topcs. c

Fgure 1. Interface of MeSH Major Topc Vew Fgure 2. Interface of SOM Tree Vew For text clusterng, we dentfed a total of 20,367 dstnct terms from the collecton after the stop word removal. A total of 22 categores contanng more than 100 ctatons were dentfed from the results of MeSH descrptor categorzaton. After the category dependent document frequency thresholdng, an average of 66 terms were selected per category, rangng from 14 terms for one category to 260 terms for another category. After the herarchcal SOM clusterng, 193 dstnct concepts were generated from 22 categores. 3. USER INTERFACES We provded four dfferent vews, three category herarches and one clusterng herarchy to users. We represented ths herarchy nformaton as herarchcal trees to help users understand MeSH qualfers and descrptors, so they could fnd a set of documents of nterest, and locate good startng ponts for searchng. 3.1 MeSH Major Topc Tree and MeSH Term Tree The MeSH term tree dsplays the categorzed nformaton space, arranged by frst descrptors and then qualfers. Fgure 1 shows the nterface of the MeSH term tree. In each level of herarchy, MeSH terms are lsted n alphabetcal order, along wth ther document frequences. When the user clcks on a category label that s ether a descrptor or a qualfer on the left pane, the assocated document set s dsplayed on the rght pane. At ths pont, f the category s a descrptor, the assocated qualfers n the collecton are also expanded as ts chldren n the tree. Users can see more detaled nformaton of a document by clckng on the ttle of a document that s shown on the rght pane. To help users better understand the meanng of an ambguous MeSH term, the correspondng descrptor data and context n the MeSH tree may be dsplayed by clckng on the lnk

MeSH Descrptor Data & Tree Structures wthn each level of the tree. In some cases, the user may want to see the category arranged by only MeSH major topcs. The MeSH major topc tree provdes the same nformaton as the MeSH term tree except that t shows the category herarchy arranged by only MeSH major topcs. 3.2 MeSH Co-occurrence Tree The MeSH co-occurrence tree provdes the cooccurrence of MeSH descrptors, along wth ther cooccurrence frequency n the collecton. Snce an average of 14 MeSH descrptors are assgned to each ctaton n the collecton, there are a large number of nodes n the co-occurrence tree. To better organze the co-occurrence tree, the nterface allows the user to select the co-occurrence frequency range. Thus, the user can easly dentfy co-occurrng semantc types n the collecton. 3.3 SOM Tree The SOM tree was constructed for each MeSH descrptor whose document frequency was less than some predetermned threshold. Typcally, 10 to 12 MeSH descrptors are assgned to each MEDLINE ctaton. Thus, some categores assocated wth a large number of ctatons do not characterze the nformaton n a way that s of nterest to the user [6]. To solve ths problem, we further arrange those categores herarchcally usng the SOM. In some cases, clusterng seems useful n helpng users flter out sets of documents that are clearly not relevant and should be gnored [6]. Fgure 2 show the nterface for browsng the SOM tree. 4. CONCLUSIONS We have proposed a three-step data mnng process for organzng MEDLINE database: (1) categorzatons accordng to MeSH terms, MeSH major topcs, and the co-occurrence of MeSH descrptors; (2) clusterng usng the results of MeSH term categorzaton; and (3) vsualzaton of categores and herarchcal clusters. The proposed SOM algorthm s dfferent from other SOM-varant algorthms. Frst, t uses the results of categorzaton. Second, after constructng the top-level concept map and aggregatng nodes wth the same concept on the map nto a group, t dynamcally reconstructs nput vector by selectng only terms that are contaned for each concept regon from the nput vector of the hgher level and recomputng ther weghts to generate the sub-layer map. Thus, the new nput vector would reflect only the contents of the regon and not the all collecton for each SOM. One of weak ponts of our approach s that the SOM algorthm s nadequate for the collecton updates when new documents are added. Another weakness s that we need to do user evaluaton for the herarchcal clusterng results. A future research wll evaluate the accuracy of clusterng results, and refne the algorthm more effectvely. REFERENCES 1. Natonal Lbrary of Medcne.: MEDLINE Fact Sheet. http://www.nlm.nh.gov/pubs/factsheets/m ed lne.html. 2. French, J. C., Powell, A. L., Gey, F. and Perelman, N.: Explotng a Controlled Vocabulary to Improve Collecton Selecton and Retreval Effectveness, In Proceedngs Tenth Internatonal Conference on Informaton and Knowledge Management (CIKM). November 2001. 199-206. 3. Pratt, W., Fagan, L.: The Usefulness of Dynamcally Categorzng Search Results. In Journal of the Amercan Medcal Informatcs Assocaton (JAMIA). 2000. 7(6), 605-617. 4. Chen, H., Schuffels, C., and Orwg, R.: Internet Categorzaton and Search: A Self-Organzng Approach. In Journal of Vsual Communcaton and Image Representaton. 1996. 7(1), 88-102 5. Chen, S.: Dgtal Lbrares: The Lfe Cycle of Informaton. Better Earth Publsher, 1998. 6. Hearst, M. A.: The Use of Categores and Clusters for Organzng Retreval Results. Natural Language Informaton Retreval, Dordrecht, Kluwer Academc Publshers. 1999. 333-374. 7. Kohonen, T.: Self-Organzaton of Very Large Document Collecton: State of the Art. In Proceedngs of ICANN98, the 8th Internatonal Conference on Artfcal Neural Networks, Skovde, Sweden. 1998. 8. Chen, H. and Ho, T. K.: Evaluaton of Decson Forests on Text Categorzaton. In Proceedngs of the 7th Conference on Document Recognton and Retreval. 2000. 191-199. 9. Salton, G., and Buckley, C.: Term-Weghtng Approaches n Automatc Text Retreval. Informaton Processng and Management, 1988. 24(5), 513-523. 10. Kohonen, T., Kask, S., Lagus, K., Salojärv, J., Honkela, J., Paatero, V., and Saarela, A.: Self Organzng of a Massve Document Collecton. IEEE Transactons on Neural Networks. May 2000. 11(3). 11. Brll, E.: A Smple Rule-based Part of Speech Tagger. In Proceedngs of the 3rd Conference on Appled Natural Language Processng, Trento, Italy. 1992.