Bioinforma)cs Resources Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J. Reeb Ins)tut für Informa)k I12
Bioinforma)cs Resources Organiza)on Schedule Overview
Organiza)on Lecture: Friday 9-12, i.e. 9.15-11.45 o clock 15 min break in between Room 00.13.009A Exercise: Friday 13-15 o clock room 01.09.014 star)ng Fri, Apr. 24th Monday 14-16 o clock room tba star)ng Mon, Apr. 27th
Team Behind the Course
Puta)ve Schedule Apr. 17 th Intro, General Overview Apr. 24 th Sequence Databases May 8 th Sequence Databases May 15 th Structure Databases May 22 nd File Formats May 29 th SQL Jun 5 th SQL Jun 12 th No-SQL Jun 19 th JavaScript / UI Jun 26 th Web Services Jul 3 rd Bioinformatics Suites Jul 10 th Wrap Up, Q&A Jul 17 th Exam (prelim)
Overview lecture is completely new from scratch first itera)on no prior syllabus available depending on the advancements in the lecture single topics could be added or dropped the sequence of topics might be shuffled hybrid nature: presenta)on of exis)ng resources are blended with back- and front- end technology
Exercises Exercises help to convert knowledge into a skill prac)cal applica)on of topics covered in the lecture ac)ve explora)on of bioinforma)cs resources implemen)ng various parts of bioinforma)cs resource
Meaning What does resource actually means? a Google query about Bioinforma)cs Resource yields about 20 Mio hits falls roughly into three categories: - databases - tools - service centers
Working on a Defini)on a collec)on of informa)on which is useful to do research in the area of life sciences/ computa)onal biology contains the informa)on itself provides appropriate interfaces to access the informa)on may provide tools for interac)ve data analysis
Genbank NIH gene)c sequence database annotated collec)on of all publicly available DNA sequences part of the Interna)onal Nucleo)de Database Collabora)on together with DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL)
Genbank II new release every 2 months retrievable via FTP from the NCBI website current release is 206.0, Feb 15 2015 187,893,826,750 bases from 181,336,445 reported sequences Genbank flat file format
Genbank III three main divisions: CoreNucleo)de, dbest, dbgss Querying over Entrez Nucleo)de interac)ve BLAST analysis with user sequences programma)c access via NCBI e- u)li)es
Swissprot/Uniprot official name: UniProtKB/Swiss- Prot history current release: 2015_04 548208 sequence entries, 195282524 amino acids abstracted from 235893 references manually annotated
Swissprot/Uniprot II manual annota)on process standard opera)on procedure controlled vocabularies guidelines offered services: BLAST, Align, ID mapping associated services
Other Uniprot Services TrEMBL Proteomes UniRef UniParc programma)cs access
PDB History currently 108124 structures, incl. 100450 proteins PDB formats data upload/valida)on data dic)onaries
PDB II retrieval programma)c access visualiza)on with the different views file format transi)ons: pdb and mmcif
SCOP/e Structural Classifica)on of Proteins history, current version is SCOPe 2.05 changes in SCOPe access needed/recommended addi)onal sooware
PFAM PFAM - current version is 27.0, March 2014 - what is is about - categories - interac)ve use - programma)c access
Prosite Prosite - current version 20.113 Mar 26 th - UniRule format and ProRule - access - typical use and interfaces
PubMed What is it for Search opportuni)es Linking to other informa)on sources Search strategies
File Formats High Throughput data: - BAM, SAM - VCF Newick tree file format Genbank/EMBL PDB: mmcif
File Formats Equivalence and transforma)ons between different formats XML formats RDF formats
SQL SQL basics data types table crea)on and manipula)on join select
SQL II keys indexes performance influence of indexes similarity search vs substrings permissions
SQL III transac)ons setup, administra)on, backup programma)c access mysql, postgresql
No SQL defini)ons of NoSQL advantages / disadvantages typical use cases types of No- SQL database query (languages)
No SQL Systems MongoDB CouchDB Neo programma)c access
(Storing Facts) triple stores data model rdf refresher query language: sparql examples
Programming Libraries roadshow of programming libriaries dedicated to bioinforma)cs: bioperl biopython biojs visualiza)on
Graphical User Interfaces principles interac)on modes modelling interac)on modes
Graphical User Interfaces interac)ve user interfaces with JavaScript language basics programming model client/server communica)on with json
JavaScript libraries for data vizializa)on/bioinforma)cs biojs D3
Client/Server Models cgi Webservices Remote Procedure Calls / CORBA security considera)ons
Authen)ca)on/Encryp)on authen)ca)on models communica)on encry)on data/result encryp)on legal privacy issues data access models
Web Services I types of web services web service components integra)on of web services in sooware
Web Services II client side interfaces to web services server side interfaces to web services Apache configura)on for web services required modules configura)on performance
Bioinforma)cs Suites where to find installa)on/configura)on workflow systems: e.g. Taverna,... EMBOSS, STADEN bio-......
Selected Bioinforma)cs Suites Aquaria ARB PEDANT PredictProtein
Summary I aim of this module: - shape the concept of a bioinforma)cs resource - become familiar with some of the most prominent examples out there - get in touch with the underlying technology - gather ideas and experience how to realize a new bioinforma)cs resource
Summary II hands on (interac)on) experience with exis)ng experience backend technology, i.e. various database models frontend technology to realize the UI/ design ra)onales communica)on models
Grading: graded by a wrisen exam 90/100 min scheduled day Jul 17 th depends on: - available room - number of par)cipants exam admission: 50% of exercise/homework points the number of points is given for every exercise sheet
Exercises Explora)on of available resources simple to intermediate programming tasks presenta)on of the task in week x submission in week x+1 feedback in week x+2
Exercises II ~10 exercise sheet awarded points between 0 and 10 work in groups of 2 one submission per group late submission fails for all group members give name of group members
Exercises III groups fixed for the course new sheets are published on Friday submission is due on Friday morning for all groups
Ques)ons & Answers Group forma)on Two slots for exercises