Study of an automatic indexing tool: NLM Medical Text Indexer Rita Pinal Fuentes and Eva Lorenzo Iglesias

Size: px

Start display at page:

Download "Study of an automatic indexing tool: NLM Medical Text Indexer Rita Pinal Fuentes and Eva Lorenzo Iglesias"

Kelly Webb
5 years ago
Views:

1 Study of an automatic indexing tool: NLM Medical Text Indexer Rita Pinal Fuentes and Eva Lorenzo Iglesias Abstract. In this paper we study the architecture of the MTI (Medical Text Indexer) and evaluate what are the parameterization options of this tool for obtain the best MESH terms. MESH is a controlled vocabulary used to indexing databases as MEDLINE. Here we show how semantic relations are more important than syntactic structure of words in a document. Keywords: Scientific document indexing, MTI, MESH, MEDLINE, PubMed. 1 Introduction The past decade has seen a growth in the amount of experimental and computational biomedical data, accompanied by an accelerated increase in the number of biomedical publications [8]. This growth has driven the emergence of institutions like NLM (National Library of Medicine) [15] to help provide health professionals access to information necessary for research, health care and education. MEDLINE is a bibliographic database of the NLM. The scope of MEDLINE includes diverse topics as microbiology, delivery of health care, nutrition, pharmacology and environmental health. MEDLINE is the most comprehensive source of biomedical bibliographic information in existence. It contains over 18 million journal citations and abstracts for biomedical literature from around the world, from 1948 to present, and the number continues to grow steeply, with over citations added in 2008 alone [16]. In order to index, search and catalog these citations, the NLM employs a vocabulary of controlled terminology, the Medical Subject Headings (MESH) [18]. MESH terms are used as keywords for archiving, storing and then locate the MEDLINE documents that correspond to certain keywords. The task of assigning MESH terms to new citations is an intensive labor, and the development and evaluation of automated approaches to assist with this task have been the subjects of intensive research. Much of this research [19] has been conducted under the auspices of the NLM s Indexing Initiative, which has produced the Medical Text Indexer (MTI) system. The MTI automatically generates ordered lists of MESH suggestions and is currently used by human curators at the NLM as an assistive tool. The goal of this work is to research about the methods used by de MTI tool for annotation of documents to be included in the public database MEDLINE. MTI is composed by modules which implement different annotation techniques, with particular emphasis on the recognition of MESH terms. Therefore, in this paper we study the architecture of the MTI and evaluate the parameterization options of this tool for obtain the best MESH candidates. Using controlled vocabularies, as MESH, increase precision and recall in searching by identifying equivalent terms.

2 2 Indexing Indexing is the task of assigning to a document a limited number of terms denoting concepts that are substantively discussed in the document. This type of indexing is useful for retrieval purposes, but it also has a strong semantic descriptive value, in that the set of terms chosen to describe a document will serve as a synopsis of the subject matter discussed in the document. Indexing is a crucial step in any information retrieval system. In this paper, we focus on the particular controlled indexing task of assigning indexing terms from the MESH thesaurus to biomedical text referenced in MEDLINE, also known as citations. For this database it has been observed that adding the MESH terms to the text does give an improvement in performance [12]. Human indexing is an expensive and intensive activity that consists of reviewing the complete text of an article and assigning MESH terms to index biomedical articles in the following way: (1) select main headings (or descriptors) to represent concepts that are substantively discussed in the article (approximately twelve main headings are selected, but the number may vary depending upon the article s length and complexity), (2) attach the appropriate subheadings to the main headings selected. Subheadings (also known as qualifiers) afford a convenient means of grouping together those citations which are concerned with a particular aspect of a subject, (3) mark the most substantively discussed concepts as major (marked with *) and (4) make sure appropriate checktags are selected, all the while (5) complying with instructions detailed in the indexing manual. As more and more documents become available in electronic form, and as more organizations develop digital libraries for their collections, the exploration of automated indexing techniques becomes both feasible and necessary to continue to provide adequate access to information. 2.1 Medical Text Indexer The Indexing Initiative System (IIS) at the NLM was begun several years ago to address the indexing problem by exploring semi-automated indexing methods with the ultimate goal of improving access to bibliographic information and providing a number of methods for automatically computing MESH terms that could be added to a document prior to standard MESH indexing. Some information retrieval experiments have shown that MTI s indexing produces retrieval results that are almost as good as that produced by human indexing [13]. Several promising techniques were studied and formed into a prototype indexing system which eventually became the Medical Text Indexer (MTI). MTI system consists of software for applying alternative methods of discovering MESH headings and then combining them into an ordered list of recommended indexing terms as shown in the Fig. 1. The top portion of the diagram consists of three paths, or methods, for creating a list of recommended indexing terms: MetaMap Indexing, Trigram Phrase Matching, and PubMed Related Citations. The two left paths actually compute Unified Medical Language System (UMLS) Metathesaurus concepts [6] which are passed to the

3 Restrict to MESH method. Results from each path are weighted and combined using the Clustering method. The system is highly parameterized not only by path weights but also by several internal parameters specific to the Restrict to MESH and Clustering methods. Next we briefly explain each component. Fig. 1. This figure (partially obtained from [13]) shows MTI structure and principal components: Metamap, Trigrams, PubMed Related Citations, Restrict to MESH and Clustering and Ranking. MetaMap Indexing. MMI [1] is a method of discovering UMLS concepts [12] and consists of applying the MetaMap program [3] to a body of text and then ordering the resulting concepts using a ranking function. Steps are: 1. Parsing. Arbitrary text is parsed into simple noun phrases using the SPECIALIST parser [2]. 2. Variant Generation. For each phrase, variants are generated, where a variant consists of one or more consecutive phrase words together with all its acronyms, abbreviations, synonyms, inflectional variants and meaningful combinations of these. 3. Candidate Retrieval. The candidate set of all UMLS strings containing at least one of the variants is retrieved. 4. Candidate Evaluation. Each UMLS candidate is evaluated against the input text by first computing a mapping from the phrase words to the candidate's words and then calculating the strength of the mapping using a linguistically principled evaluation function consisting of a weighted average of four metrics: centrality (involvement of the head of the input phrase), variation, coverage and cohesiveness. The candidates are ordered according to mapping strength. 5. Mapping Construction. Complete mappings are constructed by combining candidates involved in disjoint parts of the phrase, and the strength of the complete

4 mappings is computed just as for candidate mappings. The highest-scoring complete mappings represent MetaMap's best interpretation of the original phrase. Trigram Phrase Matching. It is a method of identifying phrases that have a high probability of being synonyms. It is based on representing each phrase by a set of character trigrams that are extracted from that phrase. The character trigrams are used as key terms in a representation of the phrase much as words are used as key terms to represent a document. The similarity of phrases is then computed using the vector cosine similarity measure according to the following algorithm: 1. Break the title and abstract of a document up into all possible phrases consisting of one to six contiguous words without internal punctuation. 2. For each phrase produced in 1, compute the similarity score against all phrases in UMLS and record the phrase that obtains the highest score. 3. For each word in the title and abstract, record that phrase of which that word is a member and which receives the highest overall score against the UMLS and record also the UMLS phrase that produced that highest score. 4. For each phrase pair obtained in 3 where one element is a phrase in the document and the other is a phrase in UMLS, count how many times the pair appears in different places in the document and return the pair, their score, and the count. Like MMI, the Trigram Phrase Matching algorithm produces UMLS concepts which are subsequently restricted to MESH headings by the Restrict to MESH method. Restrict to MESH. This is a method based on the observation that the representation of meaning in the UMLS is organized according to the principle of semantic locality [11] in which several means of representing relationships between concepts produce a cluster of semantically related terms. In the Indexing Initiative, three of these phenomena are used to find the MESH terms most closely related to any given UMLS concept: synonyms, interconcept relationships, and categorization [8]. The overall strategy for restricting a given UMLS term to the semantically closest MESH term involves the following four steps: 1. Choose a MESH term as a synonym of the source concept. 2. Choose an associated expression which is a translation of the source concept. 3. Select MESH terms from concepts hierarchically related to the source concept. 4. Base the selection on the non-hierarchically related concepts of the source concept. The algorithm stops at any step that succeeds. PubMed Related Citations (RC). This method [20] directly computes a ranked list of MESH headings based on a given title and abstract. The neighbours of a pending document (related citations) are those documents in the database that are the most similar to it. The similarity between documents is measured by the words in title and abstract. Stopwords are eliminated from processing and a limited amount of stemming of words is done, but no thesaurus is used in processing. Having obtained the set of terms that represent each document, the next step is to assign global and local weights to each term. The global weight is used in weighting the term throughout the

5 database. The global weight of a term is greater for the less frequent terms. The local weight is log(n+1) where n is the number of times the term occurs in a document. The product of the two weights is the weight of the term. The similarity of two documents is computed using the term weights defined above and is an example of vector cosine. Recommended index terms are extracted from the MESH fields of documents most similar to a given document. Clustering and Ranking. This task [12] produces a single list of recommended MESH terms by combining the recommendations of the methods described above. It computes a rank score for each suggested indexing term using term weights, cooccurrence information, and estimates of the importance of the term based on where and how the term arose. The result of the clustering process constitutes the output of the MTI system. The Clustering and Ranking task provides a weighting of the confidence or strength of belief in the assignment, and ranks the suggested headings appropriately. There are a number of factors that can be recognized as playing a role in that confidence: The path: we can assign a weight to the overall method of finding the MESH term (PathWeight). The goodness of the match: it is how much confidence is available in how the method found the MESH term. The goodness of the match depends on the method used to find the heading. Each time a MESH heading is suggested, a weighting can be given to that suggestion. This is accomplished using both a MapScore and a NavScore. The MapScore reflects the confidence in the mapping to a UMLS term, the NavScore the confidence in navigating from a UMLS term to a MESH Heading. The possibilities are: A phrase identified in a text is an exact match to a MESH term. Equivalently, it might have been a match to a UMLS term that was a synonym of a MESH term Of lesser significance is an exact match to a UMLS term that is then being mapped to a MESH heading using the Restrict to MESH method. Another possibility is that the phrase is an inexact, or approximate, match to a UMLS term, which is either a synonym of a MESH heading or mapped to MESH. The location in the text of the nominal phrase that led to that suggestion: if the heading was suggested by a phrase occurring in the title, it should be given more weight; this is because it is known that things mentioned in the title of the article are probably more important than other concepts mentioned in the article. The semantic consistency: semantic consistency can be identified by relationships that a suggested heading has with another one. These relationships might be either the occurrence in the same hierarchy (as parents or siblings), or as known cooccurring headings in MEDLINE. This latter evidence needs to be weighted according to a normalized frequency of this co-occurrence that is explained below. The four steps involved in this clustering and ranking process are: 1. Creating the Normalized Frequency Scores for the Co-Occurring Concepts: MTI creates a co-occurring concepts normalized frequency database using the UMLS

6 Metathesaurus. Co-occurrences are concepts that occur together in the same "entries" in some information source (e.g. MEDLINE). Co-occurrence relationships may exist between similar concepts (e.g., "Atrial Fibrillation" and "Arrhythmia") or between very different concepts that nevertheless have some important connection in the field of biomedicine (e.g., "Atrial Fibrillation" and "Digoxin"), or between a primary concept and a qualifier e.g., "Lithotripsy" and "instrumentation". A co-occurrence relationship can exist between two concepts that have no other apparent relationship, although the frequency of such cooccurrences will be small. From MEDLINE, co-occurrence was computed for concepts that were designated as principal or main points in the same article; i.e., the co-occurrence counts do not include articles in which either or both of the concepts were present and indexed in MEDLINE but not designated as main points. (A concept is considered to be a main point if the * is attached to the main heading or any of its subheadings). The following steps calculate the normalized frequency score for co-occurring concepts: 1.1 Summarize all of the records by combining identical pairings of CUI (Concept Unique Identifier of UMLS) counts. See example in Table 1: Table 1. Example of clustering and ranking creating Normalized Frequency (step 1.1) CUI 1 CUI 2 COF (Co-ocurrence factor) C C C C C C C C C C Determine total frequency counts for each CUI1 (Table 2): this is made summarizing the COF for each CUI1 and CUI2 combination and providing a total frequency count for each CUI1 and CUI2 pairing. Table 2. Example of clustering and ranking creating Normalized Frequency (step 1.2) CUI 1 CUI 2 COF (Co-ocurrence factor) C C C C C C C C Create a temporary file containing a single line for each unique CUI1 concept (Table 3). This line contains the total frequency count for that particular CUI1. Table 3. Example of clustering and ranking creating Normalized Frequency (step 1.3) CUI 2 Total frequency C

7 1.4 Combine 1.2 and 1.3 in a file containing all of the records of 1.2 and the total frequency count from 1.3 above appended to the end of the line (see Table 4): Table 4. Example of clustering and ranking creating Normalized Frequency (step 1.4) CUI 1 CUI 2 COF (Co-ocurrence factor) Frequency C C C C C C C C Calculate the normalization of the frequency counts for each of the records by dividing the individual record s frequency count (field 3) by the CUI1 s total frequency count (field 4). See example in Table 5: Table 5. Example of clustering and ranking creating Normalized Frequency (step 1.5) CUI 1 CUI 2 Frequency normalized C C /1190 = C C /1190 = C C /1190 = C C /1190 = Load and summarize individual path results calculating the term weights. The TermWeight for each MESH Heading is the summation of all entries for a MESH term (identified by MH) from each of the paths used (MetaMap and PubMed Related Citations). The TermWeight for each MH regardless of path is calculated using the Eq. 1, where i represents the single occurrence of the suggestion of one MESH heading: TermWeight = TW = n i=1(pathweight i * MapScore i * NavScore i ) (1) The following steps are done for each MESH Heading to calculate the Term Weight: 2.1 The weight from the item is provided by each of the individual paths along with the navigational string information. The example in Table 6 shows items returned for the concept Blood Flow Velocity via both the MMI and RC pathways. Table 6. Example of clustering phase calculating term weights (step 2.1) Path CUI Individual Navigational Concept Name MapScore string MMI C G/P Blood Flow Velocity MMI C O Blood Flow Velocity RC C NIM Blood Flow Velocity RC C NIM Blood Flow Velocity

8 In the first line we have an item coming from the MMI pathway with a MapScore of 118 out of a possible 1000 perfect score and having a navigational string of G/P (Parent/Broader) [See parameter tunning section]. In the third line we have an item coming from the RC pathway with a MapScore of out of a possible 255 perfect score and having a navigational string of NIM (MESH Heading). Perfect score is 1000 is the path is MMI or Trigram, and 255 if is RC. 2.2 The MMI items are loaded into the program before loading all the RC terms. To calculate the PathWeight to be used in the calculations for each item, the individual path weight (assigned by user) is divided by the path-scoring factor (1000 for MMI or Trigram, and 255 for RC) (see Table 7). The path-scoring factor is used to equalize all of the different scoring methods. Table 7. Example of clustering phase calculating term weights (step 2.2) Path User value Path-scoring PathWeight MMI /1000 = RC /255 = Calculate the individual item weights via (PathWeight * MapScore * NavScore), where NavScore depends on the navigation string [see Parameter tunning section] (see Table 8). Table 8. Example of clustering phase calculating term weights (step 2.3) Path PathWeight Individual Navigational Total MapScore string 1 1 MMI G/P (0.90) 118*0.0070*0.90 = MMI O (0.50) 118*0.0070*0.50 = RC NIM (0.80) *0.0078*0.80 = RC NIM (0.80) *0.0078*0.80 = Sum all of the individual item weights together to get the final TermWeight = For the Blood Flow Velocity example, TermWeight obtained is The five different path entries are summarized into a single term containing the concept name, CUI, score (which is zero at this point and will be calculated in clustering step), and the TermWeight calculated (see Table 9). 1 Navigational string are explained in section 2.3 Parameters tuning

9 Table 9. Example of clustering phase calculating term weights (step 2.4) Concept name CUI Score TermWeight Blood Flow Velocity C The summarized list for all processed items is stored in a file called mt_table as follows: Table 10. Example of clustering phase calculating term weights (step 2.5) Concept name CUI Score TermWeight mt_table[0] DNA-Binding Proteins C mt_table[1] Transcription Factors C mt_table[2] SEF1 protein C mt_table[3] Blood Circulation Time C mt_table[4] Blood Flow Velocity C mt_table[88] Regression Analysis C Clustering of the results determining which of the results are related. In the clustering phase every item is crossed in the summarized term weighted list looking for what other items in the list co-occur with the item or are related via the MESH tree structure to the item. Results of the clustering process are compartmented into co-occurring terms (COT) and MESH tree relationship terms. The MESH tree relationships are again compartmented into Parent, Child, or Sibling (PAR/CHD/SIB), called treerel, and then Broader, Narrower, or Other (RN/RB/RO), called othrel. 4. Ranking the results using the information obtained in 1 and 2 to compute the rank of each item. This is the final stage where a final RankScore is calculated for each item based on the TermWeight, the normalized frequency count, and user specified constants for COT, REL, Title, and PathWeight. The formula for the RankScore is showed in Eq.2: RankScore = TW * [F * [1+ j=1 (COT j * TW j ) + k=1 (REL * TW k )]] (2) where j represents co-ocurrent terms, k represents related terms (see Table 15) and F is the Path Factor (see Table 16). 2.2 Filtering The MTI system has three selectable levels of filtering to help remove inappropriate recommendations before they are presented to a user or returned to a program. 1. Base filtering: base layer of filtering is a collection of four rules that are used to: The (1) addition and (2) removal of MESH headings, check tags, or subheadings

10 based on recommended terms from the two pathways, (3) the boosting of certain MESH headings based on the recommended terms from the two pathways, and (4) the substitution of subheadings for certain MESH headings. Base filtering provides a mixed list of good and bad recommendations with a fair number of good recommendations near the top of the list. 2. Medium Filter: The MetaMap (MM) method tends to provide more general terms, and the very nature of the PubMed Related Citations (RC) method tends to provide a small number of spurious terms that are not related to the article being indexed. Medium filtering uses a sequence of ten heuristics to balance the results from both the MM and RC methods to help reinforce the terms from each other. Medium filtering uses the general terms from the MM method to remove spurious RC method terms by ensuring that we have at least one more general term from the MM method for any RC method term, or we remove it. Medium filtering then removes any more general MM method term when a more specific RC method term is found. The specificity of the terms is usually determined using the MESH tree hierarchy, but for longer terms may also be determined by terms being substrings of one another. This balancing of the results from the two methods allows medium filtering to filter out the more general terms and also reduce the number of unrelated terms. Medium filtering provides a good-sized list with mostly correct recommendations. 3. Strict filtering: Strict filtering is very simple, if a term was not recommended by both the MetaMap and PubMed Related Citations pathways, the term is removed. This filtering provides very high precision at the expense of ignoring good terms which were only recommended by one of the pathways. In the extreme case, no results are getting, when the RC pathway finds no related articles. Strict Filtering is not currently used in any NLM indexing environment. Base filtering and medium filtering are appropriate for most needs where base filtering produces better recall and medium filtering produces better precision. Base filtering is used to assist indexers in indexing MEDLINE, and medium filtering is used to provide fully automated indexing for abstracts collections. 2.3 Parameter tuning The overall RankScore can be altered by changing any of the constants (COT, REL, and PathWeight) or by changing the method by which the weight is calculated (NavScore and MapScore). Altering these values allows a number of experiments to be performed to evaluate the robustness of the weighting scheme, and to establish reasonable values for the constants. Tables 11 to 19 depict the parameters used in calculating the TermWeight along with their default values: Table 11. PathWeight parameters Abreviation Full Name Notes Default value Range of values

11 MMI MetaMap Indexing Path Weight for MetaMap RC Related Path Weight for Related Citations Citations T Trigram Path Weight for Trigram Table 12. NavScore parameters Abreviation Full Name Notes Default value Range of values I Direct Match Relevance scoring for term identified Navigational String as having a Direct Match to a MESH Heading. A G/P G/C G/S O ATX (Associated Expresion) Navigational String Parent/Broader Navigational String Child/Narrowe r Navigational String Sibling Navigational String Other Related Navigation String Relevance scoring for term identified as having an Associated Expression relationship to a MESH Heading. Relevance scoring for term identified as having an Parent or Broader relationship. Relevance scoring for term identified as having a Child or Narrower relationship. Relevance scoring for term identified as having a Sibling relationship to the MESH Heading. Relevance scoring for term identified as having an Other Related relationship (not synonymous, narrower or broader) to the MESH Heading NavScore parameters are related to the level of confidence between a UMLS term and MESH term. UMLS is organized in three parts: 1) a list of word forms and their lemmas, part-of-speech and morphological information, 2) a metathesaurus where assign a unique string identifier (CUI) to each term and represent relationships between terms, and 3) a semantic network which provides a grouping of concepts according to their meaning into semantic types. Existing relationships in metathesaurus are either hierarchical relationships: PAR (parent), CHD (child), RB (broader), RN (narrower), hierarchically-related: SIB sibling), or non-hierarchical, essentially associative relationships: O (other). Table 13. Related Citations parameters used to calculate NavScore. Abreviation Full Name Notes Default value Range of values IM MESH Major Relevance scoring for MESH major

12 Topic Navigational String NIM MESH Heading Navigational String NC Number of citations topic items returned from Related Citations method. Relevance scoring for normal MESH items returned from the Related Citations method. Number of related citations to use from PubMed (0 turns off the RC path) Table 14. MapScore parameters Full Name Default value Tunable by user Best possible score for items returned by the MMI path (MapScore) 1000 No Best possible score for items returned by the RC path (MapScore) 255 No Best possible score for items returned by the Trigram path (MapScore) 1000 No Table 15. RankScore parameters tunable by users Abreviation Full Name Notes Default value Range of values COT Co-occurrences factor Relevance scoring for terms identified as co-ocurring with another term. Coocurrence is identified using the REL Related Term Factor UMLS MRCOC file. Relevance scoring for terms identified as being related via the MESH tree structure. This is used during the clustering phase and figures into the overall RankScore for an item. TF Title Factor This parameter has been superceded by the Emphasize Titles factor which is a defined doubling of the score for items found in the Title field of the citation. This emphasis is done after ranking and clustering Not currently used For each pair of MESH terms, the frequency of co-occurrence in MEDLINE citations is recorded in the UMLS and can be used as a surrogate for the strength of the relationships. Therefore, co-occurrences are an important source of knowledge that has the potential to complement the limited set of symbolic relationships, and should benefit from characterization of their semantics to be fully usable.

13 Table 16. RankScore parameters no tunnable by users Full Name Default value Tunable by user TW: Term Weight - No F: Path Factor (If the items comes from MetaMap or Trigrams AND also from PubMed Related Citations F = 2 otherwise F=1) 1 or 2 No Table 17. Filtering level parameters Full Name Notes Default value Medium Filtering Remove items from the list of recommendations based on specific heuristics. String Filtering Remove all items from the list of recommendations that are not recommended by both MetaMap and PubMed Related Citations. Base Filtering Basic processing based on the default values for all options. Table 18. Post-processing parameters Full Name Notes Default value Star MESH that come from Add * to each MESH term that was identified from the Title Title. Add CheckTags Add from a list of CheckTags based on review of actual text and the list of CheckTags. Add Geographics Add from a list of Geographic Locations based on review of actual text and the list of Geographics. Remove Do Not Index With Remove MESH Terms which have been indicated as Terms Do Not Index With from our list of Show Headings Mapped to (HM) recommendations and prior to scoring. Display MESH Headings that are in fact Headings Mapped to with a HM notation versus normal MESH Headings MH notation. Show Entry Terms (ET) Replace MESH Headings with their corresponding Entry Term where applicable. Show Treecodes In the detailed outputs, add in the treecodes for each result if we have them. Show Term Unique Identifiers Normally only used in II overnight DCMS processing. Perform Aged/Human Review Make sure we don t add age related checktags if we already have the CheckTag Animals set and Humans not set. If animals is not set, and we have age related CheckTags recommended, we need to add Humans. Age related CheckTags include: Adolescent, Adult, Aged, Child, Infant and Infant Newborn. If animals is set, we sill remove any of

14 Bypass Related Citations Results Exclusion Limit Recommendations via Publication Types Limit Recommendations for Title Only Citations Rank Score Filtering for Title Only Citations Rank Score Filtering for Title & Abstract Citations Use Latest Supplemental Concepts Show MESH DUIs Use Word Sense Disambiguation (WSD) these age related CheckTags. Do not process the results obtained via the PubMed Related Citations through our MH_exclude list. Reduce the number of recommendations from the default when a citation is identified by specific Publication Types. Currently this is set as follows: PT equals WReview or News, we limit the number of returned terms to 14. If the PT equals Editorial we limit to 9, and if the PT equals Letter, we limit to 8. Reduce the number of recommendations from the default when a citation only has a title field and no abstract. This is currently calculated based on the number of words in the title: 0-2 words limits the number to 7, 3 or 4 limit to 12, limit to 13, limit to 14, and anything larger then 21 words in the title is limited to 13 items. If this is a title only abstract/citation AND the term is ranked 11 or below on the list of recommendations AND if the score is less than 190, we will stop the list. If this is a title AND abstract citation AND the term is ranked 14 or below on the list of recommendations AND if the score is less than 203, we will stop the list. Every Monday morning the MESH Vocabulary is updated. This usually only involves the Supplement Concepts. This option says to use this updated lookup list and apply any relevant changes. In the detailed output, add in the MESH Unique Identifier for each result if e have it. This options turns on the WSD option for the MetaMap path to MTI. MetaMap uses WSD to limit ambiguous UMLS Concepts it finds in the text being processed. Table 19. Output options Full Name Notes Default value Simple Simple display with only the names of the MESH Headings, CheckTags, and SubHeadings being displayed in scoring order and with annotations. Detailed Detailed display showing all relevant information about all of the topn recommendations. This includes: name of the item, CIU, final score, type, where item was found in the text, and who recommended the term. In the case of CheckTags and SubHeadings, the field after the type (CT/SH) contains the triggering information who caused this item to be included in the recommendations. Recommendations are displayed in scoring order. Expanded Detail The fields are the same as Detailed above except here we add in the Text Trigger field, that gives us a mapping of concepts to actual triggering text within the document.

15 Full Listing with Detailed The Full Listing format is the similar to the Detailed format outlined above. The differences are that the Full Listing shows the entire list and includes a number showing the list position for each recommendation. Just The Facts The fields are the same as Detailed above except here we limit to just the first four fields: PMID Term CIU Score DCMS List The DCMS List output format is a single line showing the PMID followed by zero or more recommended MESH Terms and their associated data type. Show NO_TERMS This is the same as the DCMS List above except if we have List zero recommendations for a given PMID, we will print NO_TERMS. XML In the XML output format, we enclose all the terms with XMLS tags. 3 Experimental context The high degree of parameterization of the MTI allows us to test the components for their relative contribution to the results. It is possible, for example, compare the same method using different parameter settings or the same settings across different methods. Such experiments were performed to determine optimal system parameterization values using a randomly selected sample of 1000 MEDLINE citations. This test corpus was obtained by searching PubMed with the search limited to the last 1000 items discharged between January and April The results were exported using MEDLINE format records. This format includes MESH terms assigned for NLM indexers with which we will compare our results. Each experiment will consist of processing the citations with a given set of parameters. Recommended indexing will be compared with the terms assigned by NLM indexers. As reported by Lancaster [4], it is difficult to adequately evaluate the quality of indexing because even in the case of controlled indexing, there is no unique correct indexing set to use as a reference. However, we used existing MEDLINE indexing as the good standard indexing for a citation. Throughout the study, precision, recall and F-measure are used to perform quantitative evaluations of the results. Recall corresponds to the number of pairs recommended that were also in the MEDLINE indexing divided by the total number of correct pairs according to the MEDLINE indexing. Precision corresponds to the number of pairs recommended that were also in the MEDLINE indexing divided by the total number of pairs recommended. F-measure (or balanced F-score) is the harmonic mean of precision and recall. It is computed as shown in Eq. 3, where P is precision and R is recall: F-measure = β x P x R / (P + R) (3) We selected this measure because the β=2 version of the F-measure gives recall twice the weight of precision. This corresponds to the observation that indexers will

16 tolerate some inappropriate terms as long as many useful terms are presented to them. This weighting also ameliorates the handicap of always recommending 25 terms when we know that the normal number of MESH terms assigned is closer to 12. Recall, precision and F-measure are calculated for each citation, and the average the median and the standard deviation over all the citations in an experiment are reported. The average is strongly influenced by atypical values (data not homogeneous), which does not happen with the median, thus both measures are used. The standard deviation is a statistic that tells us how tightly all the various examples are clustered around the mean in a set of data. In [14], summary results for an evaluation analysis performed in 2007 by MTI team using 200 MEDLINE documents can be consulted. 3.1 Results All experiments are performed using the same values for filtering, post-processing (Table 20) and output options (Table 21), but we have modified the parameters involved in the calculation of PathWeight, and RankScore and NavScore used from MTI Clustering phase. Table 20. Fixed options for post-processing Full Name Star MESH that come from Title Add CheckTags Add Geographics Remove Do Not Index With Terms Show Headings Mapped to (HM) Show Entry Terms (ET) Show Treecodes Show Term Unique Identifiers Perform Aged/Human Review Bypass Related Citations Results Exclusion Limit Recommendations via Publication Types Limit Recommendations for Title Only Citations Rank Score Filtering for Title Only Citations Rank Score Filtering for Title & Abstract Citations Use Latest Supplemental Concepts Show MESH DUIs Use Word Sense Disambiguation (WSD) Default value Table 21. Fixed options for output Full Name Simple Detailed Default value

17 Expanded Detail Full Listing with Detailed Just The Facts DCMS List Show NO_TERMS List XML In these experiments we based on top-25 MTI recommendations plus CheckTags but no use subheadings. In the first experiment, the parameters involved in calculating of NavWeight were adjusted. These parameters are: I (direct match), A (associated expression), G/P (parent/broader), G/C (child/narrower), G/S (sibling) and O (other relations). To test the best result for each parameter (independent of others) their values were modified with the following weights: 0.25, 0.50, 0.75 and In Table 22 is showed the average, median, variance and standard deviation of the F-measure from the 1000 values obtained. Table 22. First approximation of parameterization of NavWeight Parameter I (Direct Match) Parameter A (Associated Expression) Parameter G/P (Parent/Broader) Parameter G/C (Child/Narrower) Parameter G/S (Sibling) Average Median Variance deviation ,3483 0,3479 0,0231 0, ,3406 0,3333 0,0238 0, ,3428 0,3333 0,0238 0, ,3456 0,3438 0,0243 0,1559 Average Median Variance deviation ,3456 0,3478 0,0242 0, ,3406 0,3333 0,0237 0, ,3453 0,3448 0,0239 0, ,3502 0,3478 0,0235 0,1533 Average Median Variance deviation ,3497 0,3479 0,0237 0, ,3413 0,3333 0,0243 0, ,3453 0,3448 0,024 0, ,3486 0,3479 0,0242 0,1556 Average Median Variance deviation ,3519 0,3529 0,0241 0, ,3433 0,3428 0,0244 0, ,3468 0,3448 0,0241 0, ,3489 0,3478 0,0247 0,1572 Average Median Variance deviation ,347 0,3478 0,0244 0, ,3491 0,35 0,0241 0, ,3391 0,3333 0,0252 0, ,3466 0,3478 0,0243 0,1559 Average Median Variance deviation ,3397 0,3333 0,0236 0,1536 Parameter O (Other relations) ,3459 0,3439 0,0238 0,1543

18 0.75 0,3421 0,3448 0,0241 0, ,3433 0,3448 0,0249 0,1578 In Table 23, the best results for each parameter are combined and its efficiency as a whole is checked, by comparing it with the default values. In case of Parameter O, we test with O=0.75 (Option 1) and O=1.00 (Option 2) because both have the same median value, though 0.75 value has a lower standard deviation. Table 23. Summary table combining the best values for I, A, G/P, G/C, G/S and O parameters Average Median Variance deviation Option1 (I=0.25, A=1.00, G/P=0.25, G/C=0.25, G/S=0.50, O=1.00) 0,3845 0,3809 0,0218 0,1476 Option2 (I=0.25, A=1.00, G/P=0.25, G/C=0.25, G/S=0.50, O=0.75) 0,3484 0,3479 0,0242 0,1556 Default(I=1.00, A=1.00, G/P=0.90, G/C=0.75, G/S=0.70, O=0.50) 0,3424 0,3448 0,0243 0,1559 As observed in Table 23, this new parameterization does improve the combination of MTI's default settings. In a second attempt to get better results using NavWeight parameters, we assign values based on the best value of the predecessor parameter, trying with four values: 0.25, 0.50, 0.75 and 1.00 (see Table 24). As previous, in Table 25, the best results for each parameter are combined and compared with the default values. Table 24. Second approximation of parameterization of NavWeight I (Direct Match) using other params equal to 0 A (Associated Expression) using I=0.25 G/P (Parent/Broader) using I=0.25 and Average Median Variance deviation ,3483 0,3479 0,0231 0, ,3406 0,3333 0,0238 0, ,3428 0,3333 0,0238 0, ,3456 0,3438 0,0243 0,1559 Average Median Variance deviation ,3849 0,381 0,0219 0, ,3854 0,381 0,0218 0, ,3388 0,3333 0,0228 0, ,3545 0,3529 0,0236 0,1536 Average Median Variance deviation ,3415 0,3333 0,0231 0, ,352 0,3529 0,0232 0, ,3446 0,3428 0,0232 0,1523 A= ,3447 0,3334 0,0244 0,1562 Average Median Variance deviation G/C ,3528 0,35 0,0241 0,1552 (Child/Narrower) ,3499 0,3479 0,0236 0,1536 using I=0.25, A= ,341 0,3333 0,0239 0,1546 and G/P= ,3561 0,3529 0,0239 0,1546

19 Average Median Variance deviation G/S (Sibling) ,3518 0,3479 0,0242 0,1556 using I=0.25, ,3474 0,3478 0,0241 0,1552 A=0.50, G/P= ,3432 0,3333 0,0238 0,1543 and G/C= ,3544 0,3529 0,0244 0,1562 Average Median Variance deviation O (Other ,355 0,3529 0,0237 0,1539 Relations) I=0.25, ,3479 0,3448 0,0238 0,1543 A=0.50, G/P=0.50, ,342 0,3333 0,0236 0,1536 G/C=1.00, G/S= ,3411 0,3333 0,0251 0,1584 Table 25. Comparing second approximation with default values Average Median Variance deviation Default (I=1.00,A=1.00, G/P=0.90, G/C=0.75, G/S=0.70, O=0.50) 0,3424 0,3448 0,0243 0,1559 Option (I=0.25, A=0.50, G/P=0.50, G/C=1.00, G/S=0.25, O=0.25) 0,355 0,3529 0,0237 0,1539 As observed in Table 25, the results obtained in this experiment did not improve the results of the previous experiment (Table 23), but the results are better than results obtained with default values. In summary, as a result of these early experiments, we can conclude that the best value for these parameters, used to calculate NavWeight value, are those obtained in the first experiment (the first row in Table 23). From these results, and based on the meaning of these parameters, we can also conclude that is more important the semantic value of words in the document than its syntactic structure, i.e. the parameter I (direct match) has a very low weight against other parameters such as A (related expressions) or O (other relationships). Based on this statement, we are going to perform a new test increasing value of parameters G/P (parent or broader), G/C (child or narrower) and G/S (sibling or synonymy) to provide them greater weight, since we believe that the semantic relations represented by these parameters should be more important. Table 26. Increasing G/P, G/C and G/S parameter values Parameter values Average Median Variance deviation Option1 (I=0.25, A=1.00, G/P=0.25, G/C=0.25, G/S=0.50, O=1.00) 0,3845 0,3809 0,0218 0,1476 Option2 (I=0.25, A=1.00, G/P=0.50, G/C=0.50, G/S=0.75, O=1.00) Option3 (I=0.25, A=1.00, G/P=0.75, G/C=0.75, G/S=1.00, O=1.00) In Table 26 we can see that increasing the weight to the parameters that refer to a semantic relationship between terms succeeded in increasing F-measure value, getting

20 better results. Therefore, for the NavScore parameters we are going to use the best values (second row in Table 26). In the third experiment (see Table 27) we try to adjust the parameters involved in calculating of RankScore. The parameters are COT (co-occurrences factor) and REL (related terms factor). Table 27. Parameters COT and REL Average Median Variance deviation COT=0, REL=0 0,3493 0,35 0,0241 0,1552 COT=32767, REL=0 0,3382 0,3333 0,0252 0,1587 COT=0, REL= ,3481 0,3448 0,0241 0,1552 COT=10000, REL=0 0,3431 0,3478 0,0244 0,1562 COT=0, REL= ,3495 0,3515 0,0244 0,1562 COT=10000, REL= ,3465 0,3448 0,0243 0,1559 COT=100, REL= ,3534 0,3529 0,0231 0,152 COT=10000, REL=100 0,3424 0,3448 0,0243 0,1559 COT=10000, REL= ,3846 0,381 0,0218 0,1476 In Table 27, we can see that the best results are obtained by applying the mean value to COT parameter and the highest value to REL parameter. These results corroborate the conclusion obtained above, as the REL factor refers to relations between different terms, while the COT factor refers to words that often appear together in a particular context. Once again we can conclude the importance of charge semantic relationships of words within a document when extracting the terms that best identify it. Finally, we combine best results for NavWeight and RankScore parameters (see Table 28). Results are improved changing each parameter separately, but not combining both (NavWeight and RankScore). Table 28. Combination of NavWeight and RankScore parameters Average Median Variance deviation Default (I=1.00, A=1.00, G/P=0.90, G/C=0.75, G/S=0.70, O=0.50) Default (COT=10000, REL=100) 0,3424 0,3448 0,0243 0,1559 Default (I=1.00, A=1.00, G/P=0.90, G/C=0.75, G/S=0.70, O=0.50) Test (COT=10000, REL=32767) 0,3846 0,381 0,0218 0,1476 Option1 (I=0.25, A=1.00, G/P=0.50, G/C=0.50, G/S=0.75, O=1.00) Test (COT=10000, REL=32767) Option1 (I=0.25, A=1.00, G/P=0.50, G/C=0.50, G/S=0.75, O=1.00) Default (COT=10000, REL=100)

21 4 Conclusions and Future Work The treatment of semantic relations (synonymous, narrower, broader or related terms or expressions) between terms is essential in information retrieval and therefore in annotation or indexing of documents. In biomedical field we can use Medical Text Indexer (MTI), a tool developed to facilitate the indexing of documents, which provides MESH terms candidates extracted from the text (title and abstract). These candidate terms come from parsing syntactically the sentences of the text, looking for in another similar documents and using a metathesaurus that provides new related terms. Therefore, syntactic relationship can be found between terms but semantic relationship could be inferred. The identification of terms and their mapping to concepts is the first stage of semantic analysis. Semantic relations between concepts represent another layer of information, which have the potential of making the document search even more detailed and specific [9]. MTI is a flexible and highly customizable tool that allows users to indicate different levels of importance at the pathways to extract terms from a document using parameters that define the weight of different semantic relationships (synonymy, hyponymy, hyperonymy) compared to direct matches of words or other relationships (co-occurrences, related terms, expressions associated). The associated expressions provide a translation of some complex concepts to expressions in other vocabularies [7]. Synonymy and lexical matching techniques are used to link terms together. The identification of new instances of relations was based on observed co-occurrences of concepts using MESH tree structure. The experiments described here have resulted in improved MTI performance tuning some parameters used in clustering and ranking phase. We have increased the value of those parameters involved in calculating the ranking of terms based in related expressions, broader, narrower, sibling and other relations, obtaining better results of F-measure than using the default values. It reinforces the theory of the importance of semantic relations to indexing a document in the biomedical field and the relevance of MESH terms coming from associated expressions [7]. As a future work we can analyze these parameters using full text instead title and abstract only and extend our studies to other parameters of the tool. Other future researches could include learning semantic relations using classification techniques, where the context features of MESH co-occurrences will be expanded from verbs to other linguistic markers including grammatical functions [9].

22 References 1. Aronson AR. - Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp. 2001; Aronson AR. - The effect of textual variation on concept-based information retrieval. In: Cimino JJ, ed. AMIA Annual Fall Symposium. Washington, D. C.: Hanley & Belfus, Inc., 1996: Aronson AR, Browne AC, Rindflesch TC. Exploiting a large thesaurus for information retrieval. RIAO 94. Rockefeller University, New York, N. Y: JOUVE, Paris, 1994: Aronson AR, Gay CW, Humphrey S, Mork J, Rogers W- The NLM Indexing Initiative s Medical Text Indexer. 5. Aronson AR, Kim W, Wilbur WJ - Automatic MeSH Term Assignment and Quality Assessment. Proc AMIASymp. 2001; Bondenreider O - The Uni ed Medical Language System (UMLS): integrating biomedical terminology, September 27, Bodenreider O, Burgun A - Methods for exploring the semantics of the relationships between co-occurring UMLS concepts. 8. Bodenreider O, Chang HF, Hole WT, Nelson SJ - Beyond Synonymy: Exploiting the UMLS Semantics in Mapping Vocabularies. Proc AMIA Symp 1998; Buitelaar P, Vintar S, Volk M - Semantic Relations in Concept-Based Cross-Language Medical Information Retrieval. 10. Feldman R, Shatkay H, 2003 Mining de Biomedical Literature in the Genomic Era: An Overview. 11. McCray AT, Nelson SJ. - The representation of meaning in the UMLS. Methods of Information in Medicine 1995; 34(1-2): Medical Text Indexer Processing Flow. March 13, Available from: Accessed May 20, Srinivasan P. - Optimal document indexing vocabulary for MEDLINE. Information Processing & Management 1996; 32(5): Summary results for 200 MEDLINE Evaluation Anaylisis (March, 2007). Available from: Accessed May 20, U.S. National Library of Medicine. National Institutes of Health Fact Sheet. Available from: Accessed May 20, U.S. National Library of Medicine. National Institutes of Health - Yearly Citation Totals from 2009 MEDLINE. Available from: Accessed May 20, U.S. National Library of Medicine. National Institutes of Health Unified Medical Language System (UMLS). Available from Accessed May 20, U.S. National Library of Medicine NLM Technical Bulletin. Available from: Accessed May 20, Vasuki V, Cohen T. - Reflective random indexing for semi-automatic indexing of the biomedical literature. J Biomed Inform. 2010; 43(5): Wilbur WJ. - PubMed Related Citations Algorithm. Available at Accessed May 20, 2011.

The NLM Medical Text Indexer System for Indexing Biomedical Literature

The NLM Medical Text Indexer System for Indexing Biomedical Literature James G. Mork 1, Antonio J. Jimeno Yepes 2,1, Alan R. Aronson 1 1 National Library of Medicine, Bethesda, MD, USA {mork,alan}@nlm.nih.gov