Information Retrieval, Information Extraction, and Text Mining Applications for Biology. Slides by Suleyman Cetintas & Luo Si

Size: px

Start display at page:

Download "Information Retrieval, Information Extraction, and Text Mining Applications for Biology. Slides by Suleyman Cetintas & Luo Si"

Antony Nichols
5 years ago
Views:

1 Information Retrieval, Information Extraction, and Text Mining Applications for Biology Slides by Suleyman Cetintas & Luo Si 1

2 Outline Introduction Overview of Literature Data Sources PubMed, HighWire Press, Google Scholar, Other Sources Structure of Biomedical Language Biological Terminology Lexical and Semantic Sources for Biology Biomedical Literature Processing Applications Beyond BioCreative: Advanced Applications Summary References 2

3 Introduction Life-science research Large and heterogeneous biological data in the form of protein and genomic sequence data, expression profiles, protein structures Yet, significant amount of information in natural language Most discoveries communicated by natural language via publications, patents, reports, and e-texts on the www controlled vocabulary terms used for other biological sources: gene product annotations (e.g., Gene Ontology [GO] terms) Database records (e.g., UniProt), containing comments, keywords, descriptions etc. 3

4 Introduction Structured database entries enable efficient data retrieval, exchange, and analysis recent tendency to enrich annotation records general annotation databases such as UniProt (of 134K citations as of 2008) are of great practical value Yet, only capable of covering a small fraction of biological context information can t capture the richness of scientific information, argumentation in the literature Hard to cope up with the rapid accumulation of new publications Text mining can help to link the database entries to the evidence and argumentation in the literature 4

5 Introduction Online literature collections 5 e.g., PubMed 70 million queries every month, >20 million publications (as of 2010) crucial importance to experimental biologists, biomedical researchers, database curators, etc. Face double-exponential growth rates (due to new journals & increasing number of journal articles) Different needs Scientific community needs efficient and effective information retrieval for targeted literature searches Pharmaceutical industry uses text-mining systems for their competitive intelligence Government institutions use such tools to have a global view of the current research state

6 Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically accessible to the public through the worldwide web Efforts can be grouped under 3 categories: 1) Centralized institutional (PubMed) or academic (Highwire Press & Holllis) repositories of peer reviewed articles or abstracts II) Article collection repositories by publishers (e.g., BioMedCentral, EMBASE) III) Access to indexed scholar articles (e.g., Google Scholar, Scirus) via web-crawlers 6

7 Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically accessible to the public through the worldwide web Efforts can be grouped under 3 categories: 1) Centralized institutional (PubMed) or academic (Highwire Press & Holllis) repositories of peer reviewed articles or abstracts II) Article collection repositories by publishers (e.g., BioMedCentral, EMBASE) III) Access to indexed scholar articles (e.g., Google Scholar, Scirus) via web-crawlers 7

8 PubMed The most important resource for text mining applications Includes citations (i.e., title, abstract, authors, and source information) by participating publishers by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) Basic Search: can be accessed online by Entrez, a text based search and retrieval system Entrez improves the basic keyword searches by translating the user query to Medical Subject Heading (MeSH) terms MeSH: controlled vocabulary terms of medical domain, chemicals, genes, proteins, etc. 8

9 PubMed Growth of PubMed citations between

10 PubMed Technology development timeline for PubMed (in light green color) and other biomedical literature search tools (in light orange color) 10

11 PubMed Programmatic Access: PubMed also offers a more programmatic access to its content through: Entrez Programming Utilities Open Source Projects BioPerl, BioPhyton, BioJava, etc. for biologist programmers The NCBI provides the My NCBI service, to periodically retrieve new publications in PubMed matching a predefined user query The requester receives a corresponding notification via an alert system 11

12 PubMed For a Local PubMed it is possible to have a local relational database of all PubMed citations Obtain a licensed copy of the whole PubMed containing XMLformatted citation records from NLM/NCBI Mobile Access Txt2MEDLINE: use SMS to access PubMed PubMed Informer: Web-based PubMed monitoring tool, facilitates PDA downloads and RSS feeds 12

13 Google Scholar alternative to PubMed not only peer reviewed articles, but also other scholarly texts such as theses, books, preprint repositories often returns larger retrieval sets, (yet with substantial number of link-outs to PubMed records) does not offer the advanced search functions that PubMed offers 13

14 HighWire Press alternative to PubMed an initiative of Stanford University represents another complementary resource to PubMed Access to peer-reviewed articles, providing search interface to over 1160 journals, 4.8 million full-text articles (with over 1.9 million articles available free by HighWire partner publishers) share many search characteristics with PubMed (there are also differences of each) HighWire, further has graphical representation of articles citation map allows user specifiy where to conduct the search (title, abstract, etc.) 14

15 Other resources PubMed Central Free access to full-text articles (not only to abstracts) contains articles published before 1966 publishers have also developed platforms of searchable article repositories such as EMBASE and BioMed Central to improve the access to their articles 15

16 Structure of Biomedical Language A collection of homologous protein sequences often share a common structural fold and tend to exhibit a similar function In natural language, a particular meaning may be expressed using different but largely synonymous expressions Natural language processing (NLP) is used to decode human language exploiting the regularities and constraints that occur at multiple levels in human language These 4 levels: words, syntax, semantics, pragmatics 16

17 Structure of Biomed. Lang.: Words Tokenization and morphology: identification of words in biology text in English, word boundaries by whitespace, sentence boundaries by. (period or full stop). there are too many complications as well the JULIE (Jena University Language and Information Engineering) laboratory provides tools for token and sentence boundary detection 17

18 Structure of Biomed. Lang.: Words Tokenization and morphology: identification of words in biology text very important stage gene mention identification (BioCreative II gene mention task) some teams explored the integration of publicly available gene mention taggers, e.g. the ABNER application or the LingPipe system linking these mentions to specific entries in biological resources (gene normalization) stemming convert words to their roots, reduce variability general stemmer the Porter stemmer specific biomedical stemmers 18

19 Structure of Biomed. Lang.: Syntax Syntax: syntax or grammar of a language controls how words are grouped into meaningful phrases words can be associated with parts of speech (POS) tags POS taggers are based on machine learning algorithms (e.g., hidden Markov models) trained on manually marked corpus biomedical POS distribution slightly different than the general English special taggers for biomedical domain: MedPost tagger, dtagger POS tagging can be useful to detect textual patterns expressing protein interaction locate gene and protein mentions 19

20 Structure of Biomed. Lang.: Semantics & Pragmatics Semantics: capture the meaning e.g., c-jun is activated by VRK1 can be represented as an operator activate(vrk1,c-jun) semantic representation abstracts away the syntax Pragmatics: capture the larger context and its contribution to meaning text mining systems often rely on sentences as basic processing unit for extracting associations between biological entities descriptions of those relations goes beyond sentence boundaries, and make use of referring expressions 20

21 Structure of Biomedical Language Main NLP levels, from word tokenization to semantics 21

22 Biological Terminology Biological literature characterized by heavy use of domain-specific terminology ~12% of all terms in biochemistry pubs are technical terms a need for recognizing medical terms & their variations automatically 2 main challenges constant formation of new terms and new short forms ambiguity or polysemy (multiple meaning of the same word) 22

23 Biological Terminology ambiguity or polysemy text mining tools must select the correct sense of the word, using the context behind (for disambiguating) gene names are problem as often shared across species general English => 0.57% ambiguity medical terms => 1.01% ambiguity gene names => 14.20% ambiguity biomedical & life science literature heavily depends on short forms => further ambiguity online tools for acronym-full name pairs: ADAM, the Abbreviation Server, and AcroMine 23

24 Lexical and Semantic Sources for Biology domain-specific technical terms used for expressing functional descriptions of bio-entities, relevant biological processes, experimental techniques terminological repositories & dictionaries important resources to interpret scientific articles many have been developed ontologies developed for various subfields of biology Gene Ontology (GO) widely used as controlled vocabulary to describe biologically relevant aspects of gene products 24

25 Lexical and Semantic Sources for Biology Ontologies Gene Ontology (GO) Although primarily designed for annotation purposes, can also be used as a lexical resource for indexing via the GoPubMed application GOAnnotator allows extraction of test-based GO annotations for a given protein identifier (Swiss accession number) GO Annotation Task in BioCreative I showed that automatic detection of GO terms are more efficient in case of short terms 25

26 Lexical and Semantic Sources for Biology Word Level SwissProt biological annotation database BioThesaurus widely used resource combining gene and protein names from multiple sources TerMine developed at the National Center for Text Mining (NaCTeM) integrates automatic term recognition approach using linguistic and statistical analysis of candidate terms 26

27 Biomedical Literature Processing Applications Provide access to information in scientific articles at various levels of granularity Building blocks for biomedical text processing can be grouped with respect to the BioCreative tasks: Document retrieval: core of the interaction article subtask, to select articles about protein-protein interactions Entity mention: identification of mentions of biological entities Entity normalization: linking biological entities (e.g., genes, proteins, etc.) to biological resources (e.g., SwissProt, Entrez Gene, etc.) 27

28 BioMed. Lit. Proc. Apps: Document Retrieval Requires the ability to process and index massive volumes of data (e.g., the entire MEDLINE collection) robust, efficient wrt space and time Look for keywords that characterize a collection of papers, based on keyword frequency basis of neighbor searches in MEDLINE (the predecessor of et-blast) still the most heavily used system Statistical analysis of word occurrences many current literature mining systems rely on calculated over the whole PubMed database, resulting in weighted associations between biological entities 28

29 BioMed. Lit. Proc. Apps: Document Retrieval Statistical analysis of word occurrences underlying assumption is that if two biological entities frequently co-occur together, they should have some biological relationship can provide high recall challenge in human interpretation lacks semantic information on the type of biological association CoPub Mapper system provide online access to ranked co-occurrence associations extracted from PubMed (btw genes and biological terms) PubGene system Generates a graphical protein interaction network based on proteinprotein literature co-occurances 29

30 BioMed. Lit. Proc. Apps: Document Retrieval Stemming converts words into standardized forms (stems) essential component of IR systems and search engines one common shortcoming two semantically different words can be collapsed to a common stem used by systems such as etblast, to quantify the similarity btw documents CoPub System detects over-represented terms from multiple abstract collections etblast ranks retrieved PubMed records given an input article 30

31 BioMed. Lit. Proc. Apps: Document Retrieval Clustering Algorithms used to group genes according to their expression profiles in microarray experiments using document similarity calculation have been used by PubClust, McSyBi 31 list of systems for clustering and similarity ranking on the right

32 Biomedical Literature Processing Applications: Gene Mention & Gene Normalization Biologists search the annotation databases using gene/protein names or symbols as queries these names have been manually extracted from the literature too time-consuming unable to cover all synonyms or naming variants used by the biologists Automatic detection of protein & gene mentions 32 improves the coverage of annotation databases enable semantically refined literature search constitute a crucial initial step for other text mining systems focus of BioCreative gene mention task performance of 90% F measure training data of 15K sentences & 5K test sentences

33 Biomedical Literature Processing Applications: Gene Mention & Gene Normalization Most current bio-entity recognition systems e.g., GAPSCORE, ABGENE Can label text for protein or gene mentions other systems such as ANBER also identify cell lines or cell types Chemical compound mentions Another set of biological entities of interest Oscar, open source system for chemical entity recognition integrates dictionary of compound names as well as using regular expressions, heuristics, and certain word combinations to find chemical names in text 33

34 Biomedical Literature Processing Applications: Gene Mention & Gene Normalization Mentions of species and taxonomic names important for the emerging field of biodiversity crucial step to link gene mentions to corresponding organism source Detecting bio-entity mentions alone is often not enough to retrieve informative sentences BioIE system detects (for a given query keyword) only sentences related to protein families, functions, etc. Other applications such as ihop given a gene or protein, maps it to its corresponding db identifier, and retrieves related sentences with definition info, etc. 34

35 Biomedical Literature Processing Applications: Gene Mention & Gene Normalization Detecting bio-entity mentions alone is often not enough to retrieve informative sentences EBIMed and FACTA systems for a given query protein, present a summary table of co-occurring concepts based on PubMed abstracts FABLE retrieves co-occurring gene and protein mentions for a query keyword results can be downloaded in XML or Excell format For searching functional information for gene products search with protein sequences is possible though METIS and MedBlast systems query sequence is linked to corresponding db record, and the associated literature is retrieved afterwards 35

36 Beyond BioCreative: Advanced Applications ihop and InfoPubMed allow retrieval of protein interaction sentences from PubMed Chilibot to find supporting relationship evidence between two predefined entities of interest (genes, proteins, keywords) Mutation-Finder to extract amino-acid mutation mentions from large text collections MarkerInfoFinder to detect information related to sequence variants of human genes 36

37 Beyond BioCreative: Advanced Applications PepBank Database (of peptide sequences) a text-mining system was used to automatically detect and extract peptide sequences from abstracts and full-text papers Photo.ELM Database integrated a text-mining system to detect S/T/Y phosphorylation sites from the literature MeInfoText & PubMeth use text-mining to provide detailed information on gene methylation and association with cancer Epiloc System a text-based subcellular location prediction tool (complementing alternative sequence-based localization algs) 37

38 Summary: Biological Text Mining Applications from the Biology User Perspective Protein-relations Function annotation & localization relations Gene group & lists analysis 38

39 Summary: Biological Text Mining Applications from the Biology User Perspective Acronmy and term extraction Gene-disease assocication 39

40 Summary: Biological Text Mining Applications from the Biology User Perspective Gene-disease assocication Bio-entity tagging Text retrieval, classification, clustering, similarity ranking 40

41 Summary: Biological Text Mining Applications from the Biology User Perspective Protein sequence Gene group & lists analysis 41

42 References Main References: M. Krallinger, A. Valencia, :L. Hirschman. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol. 2008; 9:S8. Z. Lu, PubMed and beyond: a survey of web tools for searching biomedical literature, Database For original images & references to the mentioned tools, please either conduct an online search with their names or refer to the original articles above 42

43 Questions? Please let us know in case of any questions/issues! Further info: {scetinta, 43

Text mining tools for semantically enriching the scientific literature

Text mining tools for semantically enriching the scientific literature Sophia Ananiadou Director National Centre for Text Mining School of Computer Science University of Manchester Need for enriching the