Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval

Combnng Multple Resources, Evdence and Crtera for Genomc Informaton Retreval Luo S 1, Je Lu 2 and Jame Callan 2 1 Department of Computer Scence, Purdue Unversty, West Lafayette, IN 47907, USA ls@cs.purdue.edu 2 Language Technologes Insttute, Carnege Mellon Unversty, Pttsburgh, PA 15213, USA {jelu, callan}@cs.cmu.edu ABSTRACT We partcpated n the passage retreval and aspect retreval subtasks of the TREC 2006 Genomcs Track. Ths paper descrbes the methods developed for these two subtasks. For passage retreval, our query expanson method utlzes multple external bomedcal resources to extract acronyms, alases, and synonyms, and we propose a post-processng step whch combnes the evdence from multple scorng methods to mprove relevance-based passage rankngs. For aspect retreval, our method estmates the topcal aspects of the retreved passages and generates passage rankngs by consderng both topcal relevance and topcal novelty. Emprcal results demonstrate the effectveness of these methods. 1. INTRODUCTION We descrbe n ths paper the desgn of the system bult for the passage retreval and aspect retreval subtasks of the TREC 2006 Genomcs Track. The modules provded n the Lemur toolkt for language modelng and nformaton retreval (verson 4.2) 1 consttute the backbone of our system. The Indr ndex was chosen for ts support of convenently ndexng and retrevng varous felds of the documents, and ts rch query language that easly handles phrases and structured queres. New methods and tools were developed to equp the system wth enhanced capabltes on collecton pre-processng, ndexng, query expanson, passage retreval, result post-processng, and aspect retreval. Partcularly, based on the success of query expanson ndcated by the results of prevous Genomcs tracks, we contnue the exploraton of ncorporatng doman knowledge to mprove the qualty of query topcs. Acronyms, alases, and synonyms are extracted from external bomedcal resources, weghted, and combned usng the Indr query operators to expand orgnal queres. A herarchcal Drchlet smoothng method s used for utlzng passage, document, and collecton language models n passage retreval. A post-processng step that combnes the scores from passage retreval, document retreval, and a query term matchng-based method further mproves the search performance. An external database constructed from MEDLINE abstracts s used to assgn MeSH terms to passages for estmatng topcal aspects. Furthermore, passage rankngs are generated for aspect retreval by consderng both topcal relevance and topcal redundancy from the estmated aspects of the passages. The followng secton descrbes varous modules of the system developed for genomc passage retreval and aspect retreval. Secton 3 presents some evaluaton results, and Secton 4 concludes. 2. SYSTEM DESCRIPTION In ths secton we elaborate on the methods and tools used n dfferent modules of our system, focusng on query expanson, post-processng, and aspect retreval. 2.1 PRE-PROCESSING The corpus for the TREC 2006 Genomcs Track ncludes 160,472 bomedcal documents from 59 journals. The documents are n html format. The formats of the html fles are smlar but not dentcal. For example, some documents use the tag BIB to ndcate reference whle other documents use the tag Reference. Snce the TREC 2006 Genomcs Track manly focuses on passage retreval, t s mportant to desgn an effectve method to segment the bomedcal documents nto passages. A passage extracton method s developed to consder both paragraph boundares and sentence boundares. As requred by the TREC 2006 Genomcs Track, a bomedcal document s frst segmented nto many paragraphs based on the tags <p> 1 http://www.lemurproject.org/

or </p>. Specal treatment s appled to the reference part to make sure that the paragraphs for references are separate from the paragraphs n the man text part. Furthermore, the html tags and other rrelevant contents (e.g., scrpts) are removed from the extracted paragraphs. Then each paragraph s segmented nto many sentences wth a modfed verson of a perl scrpt 2. The sentences are assgned to ndvdual passages untl the length of a passage exceeds 50 words. There s no overlap among the passages, whch means that two consecutve passages contan dfferent sentences. Ths procedure s appled to all the bomedcal documents n the Genomcs Track. Altogether, there are 2.1 mllon passages extracted from the bomedcal documents. Each document on average contans 132 passages. A new verson of each document s bult by mergng the passages extracted from the document. The dentty of each passage s preserved by usng specal tags. A new text collecton s bult from all the new documents. 2.2 INDEXING All the new documents generated by the pre-processng module are ndexed by the ndexng module. The document parser n the ndexng module provded by the Lemur toolkt s modfed to further process potental bomedcal acronyms. Specfcally, two addtonal operatons are appled to the tokens recognzed by the text tokenzer: segmentaton and normalzaton. Segmentaton takes as nput a token free of space or punctuaton characters (ncludng hyphen) and looks for boundares between any two adjacent characters of the token, based on whch t segments the token nto multple tokens. A boundary occurs between two adjacent characters of the token n the followng cases: ) one s numerc and the other s alphabetc; or ) both are alphabetc but of dfferent cases, except for the case that the characters are the frst two characters of the token wth the frst n uppercase and the second n lowercase. For example, takng the token hmms2 as nput, the output of the segmentaton operaton ncludes three tokens h, MMS, and 2. Normalzaton converts Roman dgts nto ther correspondng Arabc numbers. For nstance, the II n hmms II s converted to 2 by the normalzaton operaton. 2.3 QUERY EXPANSION The query expanson module parses each of the orgnal queres and utlzes several external resources to ncorporate doman knowledge by expandng queres wth acronyms, alases, and synonyms. Both the orgnal terms and the expanded terms are weghted. Each orgnal term and ts expanded terms are combned usng the weghted synonym operator #wsyn nto a #wsyn expresson, and dfferent #wsyn expressons for a query are combned usng the #weght belef operator of the Indr query language 3. The data provded by AcroMed 4, LocusLnk 5, and UMLS 6 are processed to create three lexcons. In the AcroMed lexcon, entres are ndexed by techncal terms or phrases, and each entry s a lst of acronyms assocated wth the correspondng techncal term/phrase, accompaned by the frequences of such assocatons. In the LocusLnk lexcon, entres are ndexed by acronyms, and each entry s a lst of alases that are only assocated wth the correspondng acronym but no other acronyms. In the UMLS lexcon, entres are ndexed by techncal terms or phrases, and each entry s a lst of synonyms assocated wth the correspondng techncal term/phrase. For example, the phrase huntngton s dsease has an entry n the AcroMed lexcon 1582 hd 2 h.d., ndcatng that the acronym hd s assocated wth huntngton s dsease 1582 tmes whle h.d. s assocated wth huntngton s dsease only twce. An example for the LocusLnk lexcon s that the acronym psen1 corresponds to a lst of alases ps-1, pre1, psen, zfps1, zf-ps1. The entry provded by UMLS for the phrase mad cow dsease s bovne spongform encephalopathy, bse, bovne spongform encephalts, excludng the varants generated by varyng the form or order of the words. For each query, the lexcons are appled n the order of AcroMed, LocusLnk, and UMLS for query expanson. Generally speakng, AcroMed s used to fnd the acronyms assocated wth a techncal term or 2 http://l2r.cs.uuc.edu/~cogcomp/atool.php?tkey=ss 3 http://www.lemurproject.org/lemur/indrquerylanguage.html 4 http://medstract.med.tufts.edu/acro1.1/ 5 http://www.ncb.nlm.nh.gov/projects/locuslnk/ 6 http://www.nlm.nh.gov/research/umls

phrase (gene or dsease name) that occurs n the query. LocusLnk s used to fnd the alases of the acronyms dentfed by AcroMed. UMLS s used to fnd the synonyms of the techncal terms or phrases not recognzed by AcroMed or LocusLnk. In addton, commonly observed synonyms for some functon words such as role and dsease that occur n the query are added as expanson terms, and multple surface forms of the acronyms or alases are ncluded. Specfcally, the followng steps are performed n order to generate an expanded query: 1. If a strng of word(s) n the orgnal query has an entry n AcroMed, the strng s consdered a techncal term/phrase. The acronyms n ts entry n AcroMed whose frequences are above a threshold (25) are added as expanson terms, wth the weght of each acronym proportonal to ts frequency, normalzed by the maxmum frequency n the entry so the maxmum weght s 1.00. 2. If an acronym ncluded n the expanded query can locate n LocusLnk ts alases, the alases are ncluded and ther weghts are equal to the weght of the acronym. 3. For the strngs of words that occur n the orgnal query but are not expanded by steps 1-2, UMLS s used to fnd possble synonyms to add to the expanded query, each wth a weght of 0.50. 4. The same operatons of segmentaton and normalzaton used n the document parser are appled to the acronyms n the orgnal query, and the acronyms and alases n the expanded query. In addton, the segmented tokens of an acronym or alas output by the segmentaton operaton are fed nto the assembly operaton, whch assembles the tokens to produce multple varants wth the same weght as the acronym or alas. For example, the acronym hmms2 s segmented nto h, MMS, and 2, whch are assembled nto hmms2, h mms2, hmms 2, and h mms 2 n phrasal representatons. 5. Each word or phrase that occurs n the orgnal query and s expanded by the above steps uses the weghted synonym operator #wsym to combne tself (wth the weght 1.00) and ts expanded acronyms, alases or synonyms (wth the correspondng weghts descrbed n the above steps). The overall weght of the #wsym expresson s 2.00. 6. A few functon words commonly observed for genomc retreval are grouped and the words wthn each group are regarded as synonyms. If any word n the orgnal query occurs n a synonym group, the other words n the same group are added as expanson terms wth weghts half the value of the weght gven to the orgnal word. The weghted synonym operator #wsym s used to combne these terms. Partcularly, two groups of synonyms are used: {role, affect, mpact, contrbute} and {dsease, cancer, tumor}. The frst group of synonyms has an overall weght of 1.00 for the #wsym expresson, whle the second group of synonyms has an overall weght of 2.00. Therefore, f the words role and cancer are n the orgnal query, they wll be expanded nto 1.00 #wsyn(1.00 role 0.50 affect 0.50 mpact 0.50 contrbute) and 2.00 #wsyn(1.00 cancer 0.50 tumor 0.50 dsease). 7. Fnally, the #wsym expressons created durng steps 1-6 and the words left n the orgnal query are combned usng the #weght belef operator, wth the weght of each unexpanded word n the orgnal query set to 2.00. 2.4 PASSAGE RETRIEVAL A varant of the language modelng method s used for passage retreval. Ths method extends the tradtonal Drchlet smoothng method [Zha and Lafferty, 2001] wth herarchal smoothng. Specfcally, the log-lkelhood of generatng query Q from the j th passage of the th document s calculated as follows: uuur r psg_tf j(w)+d_probtf (w)*u 1 log ( P(Q { psg j,d } ) ) = log uuur *qtf(w) w psg j +u 1 where psg_tf j (w) and qtf(w) ndcate the term frequency of word w n the j th passage and n the user query uuur respectvely, psg j s the length of the j th passage, u 1 s a normalzaton constant (set to 200 emprcally), and d_probtf (w) ntroduces the evdence of word w from the document level, whch s estmated as: dtf (w)+p c(w)*u 2 d_probtf (w)= r d +u 2

where dtf (w) represents the term frequency of word w n the th document, d r s the length of the th document, p c (w) s the word probablty n the whole text collecton, whch s calculated by the relatve frequency of the word n the collecton, and u 2 s a normalzaton constant (set to 1000 emprcally). 2.5 POST-PROCESSING The retreved passages are frst cleaned to remove passages that don t contan meanngful content and to get rd of unnecessary characters at the begnnng or the end of each passage. Specfcally, the followng three steps are performed: 1. Smple word patterns are used to detect passages for acknowledgment, abbrevaton lst, keyword lst, address, fgure lst, and table lst, and these passages are dscarded. 2. A throw-away lst s constructed whch ncludes words commonly used n secton ttles of papers, such as ntroducton, experments, dscussons, etc. and ther morphologcal varants. If these words occur at the begnnng or the end of a retreved passage, they are removed. 3. Regular expresson patterns are used to dentfy tags, references, fgures, tables, and punctuatons at the begnnng or the end of a retreved passage n order to remove them. The cleaned passages are rescored by combnng the scores obtaned for document retreval, passage retreval, and the query term matchng-based scores recalculated for the passages. The detals of calculatng the query term matchng-based score for a cleaned passage are gven below. The terms of an expanded query are classfed nto three types, namely type-0, type-1, and type-2. Type-0 terms are terms that occur n a functon word lst contanng common terms for genomc queres such as role, contrbute, affect, develop, nteract, actvty, mutate, etc. and ther morphologcal varants. Type-1 terms are non-type-0 terms added to the query durng query expanson. Type-2 terms are non-type-0 terms n the orgnal query. The query term matchng-based passage score s calculated by accumulatng the adjusted frequences of the matched query terms n the passage, whch are computed dfferently for dfferent types of terms: Type-0: 0.50; Type-1: weght; Type-2: tf -1 weght (0.25) where weght s each term s weght n the expanded query, tf s the frequency of a term n the passage, and 0.50 s the typcal mnmum weght for a term n an expanded query. The query term matchng-based scores of the retreved passages for a query are normalzed by: S-2.00 S normalzed = mn{4.00, S max }-S mn where S s the score of a passage before normalzaton, S max and S mn are the maxmum and mnmum scores of the retreved passages for a query. Because the weght of a type-2 term n the query s 2.00, the mnmum score of a retreved passage that matches at least one type-2 term s 2.00. If a retreved passage fals to match any of the type-2 terms n the query, t s very lkely to have a negatve score unless t matches multple type-1 terms. 4.00 s the mnmum score of a retreved passage that matches at least two dstnctve type-2 terms. Usng mn{4.00, S max } nstead of S max for normalzaton s to downgrade the dfferences between the passages that match two or more dstnctve type-2 terms snce these passages probably have the same degree of relevance. The orgnal scores of the retreved passages and the orgnal scores of ther correspondng documents are normalzed usng the standard max-mn normalzaton: S-Smn S normalzed = S max -S mn The fnal score of a retreved passage s a weghted lnear combnaton of the normalzed passage retreval score (wth a weght of 0.85), the normalzed document retreval score of the document whch contans the passage (wth a weght of 0.05), and the normalzed query term matchng-based passage score (wth a weght of 0.10). Passages for a query are reranked usng the fnal scores. =1

2.6 ASPECT RETRIEVAL Although topcal relevance s the most mportant factor n nformaton retreval, an effectve nformaton retreval system also needs to consder the topcal aspects of the retreved nformaton. For example, a bomedcal researcher would lke to avod seeng smlar or even duplcated contents. Therefore, the redundant nformaton should be removed. The retreval performance s generally consdered better f the top-ranked documents or passages are not only relevant but also cover a wde range of aspects. Our aspect retreval module consders both topcal relevance and the coverage of aspects. Topcal relevance can be drectly measured by the scores from the passage retreval module (wth or wthout rescorng n the post-processng module). Topcal aspects of each retreved passage can be based on the MeSH (.e., Medcal Subject Headng) terms whch best descrbe the semantc meanng of the passage. However, snce MeSH terms are only assocated wth a whole document n the MEDLINE database, the MeSH terms of each retreved passage need to be estmated. The process of assgnng approprate MeSH terms to retreved passages s vewed as a multple-category classfcaton problem n ths work. Specfcally, each retreved passage s regarded as a query to locate smlar documents from a subset of the MEDLINE database. As the smlar documents found tend to share smlar MeSH terms wth ths passage, we can assgn MeSH terms to the passage based on the MeSH terms assocated wth the smlar documents. In our work, the subset of the MEDLINE database s formed by the data for the adhoc retreval task of the TREC 2003 Genomcs Track, whch nclude 4,491,008 MEDLINE abstracts publshed durng 1993-2003. Most of the abstracts are assocated wth MeSH terms assgned by doman experts. After some text preprocessng such as removng stopwords and stemmng, the ttle feld and the body text feld of each abstract are ndexed. The average document length s about 160. For each user query, the content of each of the 500 top-ranked passages from the ranked lst of passage retreval s obtaned and processed by removng stopwords and stemmng to form a document query, whch s used for locatng smlar documents from the subset of the MEDLINE database. The Okap formula s used to retreve the 50 top-ranked MEDLINE abstracts as the most smlar documents. Then, the most common 15 MeSH terms are extracted from these MEDLINE abstracts. Each MeSH term s assocated wth a weght based on the number of occurrences of ths term n the top-ranked MEDLINE abstracts. These 15 MeSH terms are further represented by a vector n a vector space formed by all the MeSH terms. We utlze the TF.IDF method as the term weghtng scheme of the vector space: ctf(w) val(w)=tf(w) log c Where tf(w) s the number of occurrences of a specfc MeSH term wthn the top-ranked MEDLINE abstracts, ctf(w) s the number of occurrences of ths MeSH term wthn the subset of the MEDLINE database, and c s the total number of occurrences of all MeSH terms n the database. In addton to extractng representatve MeSH terms for each passage by analyzng the content smlarty between the passage and the MEDLINE abstracts, the MeSH terms assocated wth the bomedcal document that contans ths passage can also be used to generate the MeSH terms for ths passage. In order to utlze both peces of evdence, we use a lnear form to combne the vector representatons of the MeSH terms from these two sources. More specfcally, the two vectors are normalzed respectvely, and the normalzed vectors are summed together wth an equal weght (.e., 0.5). The fnal representaton s obtaned by normalzng the sum. The procedures descrbed n the above paragraphs can be used to derve the MeSH representatons for all the top-ranked passages (top 500 n ths work) for a user query. These MeSH representatons reflect the topcal aspects of the passages. Wth both the topcal aspects and the topcal relevance nformaton (.e., passage retreval scores), a new ranked lst can be constructed by rerankng the passage retreval result. Partcularly, a procedure smlar to the maxmal margnal relevance method [Carbonell and Goldstern, 1998] s adopted here. Ths s a gradent-based search approach. At each step, a document s selected and added to the bottom of the current reranked lst. A combnaton score s calculated for each passage by consderng both the topcal relevance nformaton and the novelty nformaton of the topcal aspects wth respect to the current reranked lst (.e., the selected passages).

0.5 0.4 Best PCPsgRescore PCPsgClean Medan 1 0.8 Best PCPsgAspect Medan 0.3 0.6 MAP MAP 0.2 0.4 0.1 0.2 0 160 162 164 166 168 170 172 174 176 178 180 182 184 186 Topc # S (a) passage retreval comb (psg ) = λs rel (psg ) (1 λ) max (Sm(psg,psg )) psg j Sel 0 160 162 164 166 168 170 172 174 176 178 180 182 184 186 Topc # (b) aspect retreval Fgure 1 The performance of our system compared wth the best and medan performance. j psg Sel where S comb represents the combnaton score, S rel represents the normalzed passage retreval score (.e., dvded by the maxmum score), λ s the factor to adjust the relatve weghts of the topcal relevance nformaton and the topcal aspect nformaton (set to 0.5 n ths work), Sel s the current reranked lst of the selected passages, Sm s a functon that calculates the cosne smlarty between the MeSH term representatons of two passages to reflect ther topcal smlarty. At each step, the combnaton scores are calculated for the passages that are not n the current reranked lst for a query. Snce the passage wth the maxmum combnaton score reflects a good trade-off between topcal relevance and topcal novelty, t s added to the bottom of the current reranked lst. Note that the above procedure s only appled to rerank the top 1-500 passages. The top 501-1000 passages are stll the same as the passages n the ranked lst of passage retreval solely based on topcal relevance. 3. EVALUATION We submtted three runs usng automatcally constructed queres. PCPsgClean used the automatcally expanded queres to conduct passage retreval, and the retreved passages were cleaned durng postprocessng but not rescored. PCPsgRescore reranked the results obtaned from PCPsgClean usng the rescorng method descrbed n Secton 2.5. PCPsgAspect further processed the results from passage retreval to optmze performance for aspect retreval based on the method descrbed n Secton 2.6. None of our results were optmzed for document retreval performance. Fgure 1 shows the performance of our system compared wth the best and medan performance for passage retreval and aspect retreval. For passage retreval, our PCPsgRescore run has 100% of the topcs achevng performance better than the medan, and 5 topcs acheve the best performance. For aspect retreval, our PCPsgAspect run has 92% of the topcs achevng performance better than the medan, and 1 topc acheves the best performance. 4. CONCLUSION Our results for passage retreval show that query expanson based on external bomedcal resources s an effectve technque, and the herarchcal Drchlet smoothng method that utlzes passage, document, and collecton language models works reasonably well for passage retreval. Rerankng the retreved passages by combnng scores from passage retreval, document retreval, and the query term matchng-based rescorng consstently further mproves the performance of passage retreval. Our method of estmatng topcal aspects of the retreved passages and generatng passage rankngs by consderng both topcal relevance and topcal novelty has an acceptable performance but stll leaves room for mprovement. REFERENCES [1] Carbonell, J. and Goldsten, J. (1998). The use of MMR, dversty-based rerankng for reorderng documents and producng summares. In Proceedngs of the 21 st Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval. ACM. [2] Zha, C. X. and Lafferty, J. (2001). A study of smoothng methods for language models appled to ad hoc nformaton retreval. In Proceedngs of the 24 th Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval. ACM.