The LAILAPS Search Engine - A Feature Model for Relevance Ranking in Life Science Databases

Similar documents
Software review. Biomolecular Interaction Network Database

Bioinformatics Hubs on the Web

mpmorfsdb: A database of Molecular Recognition Features (MoRFs) in membrane proteins. Introduction

Genome Browsers - The UCSC Genome Browser

Literature Databases

INTRODUCTION TO BIOINFORMATICS

Exploring and Exploiting the Biological Maze. Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix

BovineMine Documentation

HymenopteraMine Documentation

RLIMS-P Website Help Document

INTRODUCTION TO BIOINFORMATICS

Multiple Sequence Alignment

Core Technology Development Team Meeting

CACAO Training. Jim Hu and Suzi Aleksander Spring 2016

Tutorial: Using the SFLD and Cytoscape to Make Hypotheses About Enzyme Function for an Isoprenoid Synthase Superfamily Sequence

Topics of the talk. Biodatabases. Data types. Some sequence terminology...

SciMiner User s Manual

EBI services. Jennifer McDowall EMBL-EBI

8:15 Introduction/Overview Michelle Giglio. 8:45 CloVR background W. Florian Fricke. 9:15 Hands-on: Start CloVR W. Florian Fricke

Facilitating Semantic Alignment of EBI Resources

EVIDENCE FOR SHOWING GENE/PROTEIN NAME SUGGESTIONS IN BIOSCIENCE LITERATURE SEARCH INTERFACES

Geneious 5.6 Quickstart Manual. Biomatters Ltd

org.hs.ipi.db November 7, 2017 annotation data package

Biostatistics and Bioinformatics Molecular Sequence Databases

How Can a Warehouse Manage Your Wheat-(Data)?

Annotating a Genome in PATRIC

An Algebra for Protein Structure Data

EBI is an Outstation of the European Molecular Biology Laboratory.

The genexplain platform. Workshop SW2: Pathway Analysis in Transcriptomics, Proteomics and Metabolomics

IPA: networks generation algorithm

Core Technology Development Team Meeting

Retrieval of Highly Related Documents Containing Gene-Disease Association

Eagle Eye. Sommersemester 2017 Big Data Science Praktikum. Zhenyu Chen - Wentao Hua - Guoliang Xue - Bernhard Fabry - Daly

Genome Browsers Guide

This manual describes step-by-step instructions to perform basic operations for data analysis.

Finding and Exporting Data. BioMart

User Manual. Ver. 3.0 March 19, 2012

EBI patent related services

Editing Pathway/Genome Databases

Presenter: Payam Karisani

Our Task At Hand Aggregate data from every group

Lab 4: Multiple Sequence Alignment (MSA)

Biobtree: A tool to search, map and visualize bioinformatics identifiers and special keywords [version 1; referees: awaiting peer review]

This document contains information about the annotation workflow for the Full BioCreative interactive task.

Steering Committee Meeting

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

Blast2GO Teaching Exercises SOLUTIONS

Overview of BioCreative VI Precision Medicine Track

2. Take a few minutes to look around the site. The goal is to familiarize yourself with a few key components of the NCBI.

mgu74a.db November 2, 2013 Map Manufacturer identifiers to Accession Numbers

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

Bio wikis. Paolo Romano Bioinformatics, National Cancer Research Institute, Genova

Applied Bioinformatics

Structural Bioinformatics

SQL Studio (BC) HELP.BCDBADASQL_72. Release 4.6C

Lecture 5 Advanced BLAST

Exercises. Biological Data Analysis Using InterMine workshop exercises with answers

mogene20sttranscriptcluster.db

The Gaggle. Connect R to Firefox and assorted Java programs. Paul Shannon. Institute for Systems Biology & Seattle Biomedical Research Institute

MDA Blast2GO Exercises

TBtools, a Toolkit for Biologists integrating various HTS-data

Applied Bioinformatics

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

HOW TO SUBMIT AN ASSIGNMENT

Two Examples of Datanomic. David Du Digital Technology Center Intelligent Storage Consortium University of Minnesota

SciVerse ScienceDirect. User Guide. October SciVerse ScienceDirect. Open to accelerate science

About the Edinburgh Pathway Editor:

EURISCO catalogue. Status quo & planned developments. EURISCO training workshop, 19 th to 21 st May 2015, Tirana, Albania. Stephan Weise 19 May 2015

mirnet Tutorial Starting with expression data

User guide for GEM-TREND

HsAgilentDesign db

Annotating a single sequence

Update: MIRIAM Registry and SBO

Laboratorio di Basi di Dati per Bioinformatica

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Analyzer of Bio-resource Citations. World Data Center of Microorganisms(WDCM)

Oracle Application Express 5.1

Enhancing Internet Search Engines to Achieve Concept-based Retrieval

hgu133plus2.db December 11, 2017

Load KEGG Pathways. Browse and Load KEGG Pathways

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

Supplementary Note 1: Considerations About Data Integration

Core Technology Development Team Meeting

LabWare 7. Why LabWare 7?

RDF for Life Sciences

Tutorial: chloroplast genomes

NCBI News, November 2009

Welcome - webinar instructions

Abstract. of biological data of high variety, heterogeneity, and semi-structured nature, and the increasing

Differential Expression Analysis at PATRIC

ESG: Extended Similarity Group Job Submission

Pathway Analysis using Partek Genomics Suite 6.6 and Partek Pathway

Tutorial 1: Exploring the UCSC Genome Browser

Chapter 2. Architecture of a Search Engine

MetScape User Manual

BMMB 597D - Practical Data Analysis for Life Scientists. Week 12 -Lecture 23. István Albert Huck Institutes for the Life Sciences

Genomic pathways database and biological data management

Searching the World-Wide-Web using nucleotide and peptide sequences

SpiraTeam Help Desk Integration Guide Inflectra Corporation

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

Transcription:

International Symposium on Integrative Bioinformatics 2010 The LAILAPS Search Engine - A Feature Model for Relevance Ranking in Life Science Databases M Lange, K Spies, C Colmsee, S Flemming, M Klapperstück, U Scholz Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Germany lange@ipk-gatersleben.de IPK Gatersleben, March 22, 2010

Outline Information Retrieval in Live Science The LAILAPS Approach Query Engine & Feature Scoring Relevance Prediction The LAILAPS System 03/22/2010 2

Life Science Data Universe 03/22/2010 3

Life Science Data Universe Data Domain Entry# Data Domain Entry# Literature (PubMed): 19*10 6 Molecules (PubChem) 62*10 6 DNA-sequences (GenBank): 112*10 6 Data Domain Entry# Protein sequences(uniprot): 11*10 6 Protein functions (OLS, PDB, INTACT, PFAM): 1*10 6 Pathway (KEGG): 0.1*10 6 Genes (KEGG): 5*10 6 03/22/2010 4

Life Science Data Universe chlorophyll synthase words: 1111 lines: 164 words: 1516 lines: 238 476 entries of chlorophyll synthase on average 1200 words per entry on average 200 lines 571,200 words 95,200 lines 1322 pages A4 03/22/2010 5

Life Science Data Universe - Search 03/22/2010 6

The Story of Searching Data 1. 37% of life science scientists spending over 80% of working time in front of a computer 2. 47% of all scientist make daily use of search engines Divoli A, Hearst MA, Wooldridge MA., Evidence for showing gene/protein name suggestions in bioscience literature search interfaces., Pac Symp Biocomput. 2008:568-79 03/22/2010 7

The LAILAPS Approach 03/22/2010 8

Motivation for LAILAPS platform for information retrieval over isolated databases content sensitive relevance ranking user profiles for relevance estimation estimation of relevance factors by tracking user behavior 03/22/2010 9

LAILPAS Search Engine relevance score feature vector hit excerpt & link to data browser phrase and conjunctive multi term queries user login Javascript data feature scores browsing behaviour tracker synonym & database filter chart type relevance rating 03/22/2010 10

LAILAPS Feature Model 1. attribute in the record 2. database in which the hit was found 3. frequency of hits per attribute and record 4. co-occurrence of query terms 5. keyword surrounding the hit position 6. organism that is reported in the record 7. sequence length 8. text position of the hit in the record structure 9. synonym that produced the hit 03/22/2010 11

LAILAPS Ranking Workflow query Q = t 1 t n find: using a text index system relevance ( 1 9) : using a feed forward neural network rank: sorting of relevance scores relevance ranking 1 n query result R = (t,d,{(a,p)}) score: using features F 1 F 9 relevance score chlorophyll synthase ('chlorophyll', 'Q9MBA1', {('DE', 83), ('RT',2)} ('synthase', 'Q9MBA1', {('DE', 97)} entries feature scores ( 1 9) score freq = 3 score co-ocurence = 1-(distance/maxdist) * {cor: 1; wrong: 0.3}... N feedforward :(neurons = 20, input =9, output =1, edges = 7 x 4) Q9MBA1 = N(68, 27,71,33,80,0,31,0,0) = 0.935 03/22/2010 12

Query Engine: Text Index decompose text into textual atoms avoid line breaks in tokens do not decompose number groups keep chars in token _-/*+. ignore chars in token ()[]<>'',:' token delimiter whitspace and =;\t stop words (ignore frequent but useless tokens) standard in natural language but, his, nor, than, was, by, how, not, that, we, can, however, of, the, were more in life sciences product, locus_tag, db_xref, region, source, DB, organism, CDS, xref, interpro, submission, submitted, pfam, binding, table, embl, geneid, genomic, transl, cdd, sequence, pubmed, taxon, IEA, NIH, genbank, data, here synonyms (from databases Brenda, UniProt) Synonyms Relations: 67521 Protein names: 76869 M. Lange, K. Spies @ IPK-Gatersleben 09/12/2008 13

Relevance Prediction 03/22/2010 14

Relevance Prediction: Confidence Classes confidence class 1 2 3 accession curated rank lailaps rank relevant problem Acc1 1 5 no wrong ranking Acc2 2 2 yes Acc3 3 3 yes Acc4-4 - additional hit (e.g. synonm) Acc5 4 5 yes Acc6 5 - no database version Acc7 6 6 yes Acc8 7 7 yes Acc9 8 - no text decomposition Acc10 9 8 yes Acc11 10 1 no wrong ranking 03/22/2010 15

Relevance Prediction: Training Data 03/22/2010 16

Relevance Prediction: Training Data 22 protein queries using data retrieval systems 1069 manual ranked database records 3 quality classes (high, medium, low) 03/22/2010 17

Relevance Prediction: Neural Network Training split of training/validation data: 80/20 training epochs: 500 mean square error: 0.33 type: feed forward architecture: 7-4 03/22/2010 18

User Ranking Profile: Relevance Rating Explicit rating Implicit rating: tracking browsing behavior using JavaScript clicked result entries clicked entries above, below activity time scroll amount mouse movement lost / got focus text selection 03/22/2010 19

Relevance Prediction: Workflow Manual Rating Predicted Rating Query Results Training Data Feedback System Relevance Prediction 03/22/2010 20

The LAILAPS System 03/22/2010 21

LAILAPS Technical Background 3rd-party Retrieval System (e.g. SRS@EBI, in house) Color Code - Technology: Tapestry / ispring ORACLE 11g Color Code Software Modules: Frontend HTTP/SQLNet JAVA 1.5, JOONE, SPRING Network Communication HTML Result Browser Text Query Render Engine Search Form RMI Result Ranking SQL Administration Configuration Feedback Tracking DBMS Backend/Logic ORACLE 11g 03/22/2010 22

LAILAPS Customization database to be indexed synonyms configuration 03/22/2010 23

LAILAPS Benchmark 03/22/2010 24

Benchmark of Neuronal Network Ranking Precision in % 03/22/2010 25

Conclusion 1. LAILAPS as search engine for life science dabases 2. objective subjectiveness learning of human relevance classes 3. cost-reduction of scientific information retrieval # database records estimated effort in person day - 50 0,5 51 100 1 101-250 > 1 4. user > 250 relevance profiles objective rating hardly not possible 5. deterministic quality of database research 03/22/2010 26

LAILAPS Project: http://lailaps.ipk-gatersleben.de Contact: lange@ipk-gatersleben.de Acknowledgement Jinbo Chen (IPK Gatersleben) Mandy Weißbach (IPK Gatersleben) Gregor Haberhauer (BASF) Michael Leps (BASF Plant Science) Jens Stein (BASF Plant Science) Röbbe Wünschiers (FH Mitweidae) 03/22/2010 27