Bioinforma)cs Resources - Swissprot -

Size: px

Start display at page:

Download "Bioinforma)cs Resources - Swissprot -"

Madlyn Fleming
6 years ago
Views:

1 Bioinforma)cs Resources - Swissprot - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J. Reeb Ins)tut für Informa)k I12

2 Puta)ve Schedule Apr. 22 nd Intro, General Overview (1. sh.) Apr. 29 th Sequence Databases (2. sh.) May 6 th No lecture May 13 th Sequence Databases (3. sh.) May 20 th Structure Databases (4. sh)* May 27 th SQL (5. sh.) Jun 3 rd SQL (6. sh) Jun 10 th No-SQL (7.sh.) Jun 17 th No-SQL (8.sh.)* Jun 24 th JavaScript / UI (9.sh.) Jul 1 st Web Services (10.sh.) Jul 8 th Bioinformatics Suites / Forums Jul 15 th Wrap Up, Q&A Jul 28 th Exam, 10:30-12:00 MW1050 * These exercises can earn you a bonus

3 XML Infusion (in 10 sec) compila)on from hmp:// XML is a soqware- and hardware-independent tool to store and to transport data XML stands for extensible Markup Language designed to store and transport data designed to be self-descrip)ve W3C recommenda)on it does NOT DO anything

4 About Tags XML tags are not predefined like HTML tags everybody can/has to invent his own tags new tags can be added any )me the author has to define content and structure of the document everything is plain text

5 Document Structure <?xml version="1.0" encoding="utf-8"?>! <bookstore>!! <book category="cooking">! <title lang="en">everyday Italian</title>! <author>giada De Laurentiis</author>! <year>2005</year>! <price>30.00</price>! </book>!! <book category="children">! <title lang="en">harry Potter</title>! <author>j K. Rowling</author>! <year>2005</year>! <price>29.99</price>! </book>!!...! </bookstore>!! taken from hmp://

6 Syntax Rules elements are defined using tags: <tagname>... </tagname> or <tagname/>! elements can be nested (contain other elements - parent and child nodes, sibling nodes) elements can have text content each document must contain ONE root element that is the parent of all other elements

7 Syntax Reﬁned prolog line <?xml...> is op)onal tags must be (self-) closed tag are case sensi)ve tags must be properly nested: <a><b>...</a></b> <a><b>...</b></a>! Wrong! Right!

8 Syntax Refined tags may have amributes amribute values must always be quoted some special characters cannot be used directly -> coded by en)ty references: < < less than > > greater than & & ampersand ' apostrophe " quota)on mark comments: <! >!

9 case sensi)ve Tag Names must start with a lemer or underscore must not start with the lemers xml in any case can contain: lemers, digits, hyphens, underscores and periods cannot contain spaces apply common sense and a consistent style avoid: minus (-), period (.), colon (:), non-english characters for compa)bility reasons

10 XML Element everything between the start and the end tag tags are included can contain: - text - amributes - other elements - a mix of all are extensible

11 XML AMributes values must be quoted: single or double quotes the unused character can be used inside the value decision for amribute or element undecided, but: - amributes cannot contain mul)ple values - amributes cannot contain tree structures - amributes are not easily expandable useful to store meta data, like element id, etc.

12 A Glimpse of Namespaces allow to prevent tag name collisions between different authors/applica)ons/domains implemented by the introduc)on of prefixes defined as an amribute: xmlns:prefix= URI! usage: <prefix:tagname>! the URI is only needed to be unique used to integrate other specifica)ons, e.g. XSLT

13 Levels of Correctness well formed: a document obey the syntax rules: - root element - closing tag - case sensi)ve - properly nested - amribute values quoted valid documents: in add)on to being valid the also conform to a document type defini)on (format specifica)on)

14 Document Type Defini)ons two ways to specify a document structure: DTD: Document Typ Defini)on XML Schema: XML based alterna)ve to DTD

15 Example <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE note SYSTEM "Note.dtd > <note> <to>tove</to> <from>jani</from> <heading>reminder</heading> <body>don't forget me this weekend! &copyright; </body> </note>!

16 Example <!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> <!ENTITY copyright Copyright by.. > ]>!

17 XML DTD referenced from a document with: <!DOCTYPE note SYSTEM "Note.dtd">!!DOCTYPE defines the root element!element defines the structure of the elements #PCDTA means parse-able text data!entity defines special characters or strings

18 XML Schema alterna)ve to DTD <xs:element name="note > <xs:complextype> <xs:sequence> <xs:element name="to" type="xs:string"/> <xs:element name="from" type="xs:string"/> <xs:element name="heading" type="xs:string"/> <xs:element name="body" type="xs:string"/> </xs:sequence> </xs:complextype> </xs:element>! support of data types and namespaces wrimen in XML and extensible!

19 Names and Other Complica)ons Amos Bairoch Ioannis Xenarios taken from hmp://web.expasy.org/ images/people/amos_bairoch.jpg taken from hmp:// Ioannis.Xenarios

20 History 1986 A. Bairoch created Swiss-Prot at the University of Geneva, since 1988 in collabora)on with EMBL/EBI 1993 together with Ron Appel launch of ExPASy 1998 Founda)on of SIB (Swiss Ins)tute of Bioinforma)cs) 2002 Founda)on of the UniProt consor)um by EBI, SIB and PIR

21 UniProtKB: UniProt Components: - UniProtKB/Swiss-Prot - UniProtKB/TrEMBL UniParc: pure sequence archive, no annota)ons UniRef: consists fo three databases of clustered sets of protein sequences (UniRef100, UniRef90, UniRef50) using the CD-HIT algorithm UniMes: data from metagenomic and environmental samples, not in UniProtKB

22 ExPASy hmp:// Expert Protein Analysis System (1993) now: SIB ExPASy Bioinforma)cs Resources Portal Ar)mo P, Jonnalagedda M, Arnold K, Bara)n D, Csardi G, de Castro E, Duvaud S, Flegel V, For)er A, Gasteiger E, Grosdidier A, Hernandez C, Ioannidis V, Kuznetsov D, Liech) R, Moreo S, Mostaguir K, Redaschi N, Rossier G, Xenarios I, and Stockinger H. ExPASy: SIB bioinforma9cs resource portal, Nucleic Acids Res, 40(W1):W597-W603, 2012.

23 Expasy Categories Proteomics Genomics Structural Bioinforma)cs Systems biology Phylogeny/evolu)on

24 Expasy Categories Popula)on gene)cs Transcriptomics Biophysics Imaging Drug Design

25 Resource Descrip)on 1. Resource name and descrip)on 2. Maintaining SIB group 3. Scien)fic category 4. Keywords: a controlled vocabulary is used to tag the resource

26 Resource Descrip)on 5. URL for the web interface and for the download if available 6. SoQware type: website, command line interface, GUI, etc 7. Status: green checkbox if currently available

27 UniProt/SwissProt Sta)s)cs Release 2016_05, May. 11 th taken from hmp://web.expasy.org/docs/ relnotes/relstat.html sequence entries ( in 2015_05)/ amino acids ( in 2015_05)

28 UniProt/SwissProt Sta)s)cs Growth over one year: 2016_5 vs 2015_5 Protein existence (PE) Entries % 1. Evidence at protein level (85.419) 16.8 (15.6) 2. Evidence at transcript level (61.814) 10.5 (11.3) 3. Inferred from homology ( ) 70.3 (70.7) 4. Predicted (11.526) 2.1 (2.1) 5. Uncertain (1.962) 0.4 (0.4)

29 Development taken from hmp://web.expasy.org/docs/relnotes/relstat1.png for release 2015_5

30 More Numbers (rel. 2015_5) Represented species: Top 20 species: sequences, i.e. 21.3% of the total number of sequences Entries No of Species Entries No of Species >

31 Species Representa)on (rel. 2015_5) Top Frequency Species Homo sapiens (Human) Mus musculus (Mouse) Arabidopsis thaliana (Mouse-ear cress) Rattus norvegicus (Rat) Saccharomyces cerevisiae (Baker s yest) Bos taurus (Bovine) Schizosaccheromyces pombe (Fission yeast) Escherichia coli K Bacillus subtilis Dictyostelium discoideum (Slime mold)

32 Representa)on of the Divisions (rel. 2015_5) Viruses (3%), Archaea (4%), Eukaryota (33%), Bacteria (61%),

33 Distribu)on of Eukaryota (rel. 2015_5) Nematoda (2%), 4417 Insecta (5%), 8781 Fungi (17%), Viridiplantae (20%), Other (8%), Human (11%), Other Mammalia (26%), Other Vertebrata (10%), 17823

34 Length Distribu)on (rel. 2015_5)

Amino Acid Composi)on (rel. 2015_5) figure taken from http://web.expasy.org/docs/relnotes/relstat.

35 Amino Acid Composi)on (rel. 2015_5) figure taken from gray=aliphatic, red=acidic, green=small hydroxy, blue=basic, black=aromatic, white=amide, yellow=sulfur

36 SwissProt Annota)on Process defined in hmp:// sop_manual_cura)on.pdf explained in hmp://

37 Annota)on Phases 1. Sequence cura)on 2. Sequence analysis 3. Literature cura)on 4. Family-based cura)on 5. Evidence amribu)on 6. Quality assurance, integra)on and update

38 Sequence Cura)on more than 95% are translated CDS from INSDC other sources: PDB, direct protein sequencing, projects not submiong to INSDC sequences are selected according to cura)on priori)es (hmp:// results in the canonical sequence for a gene/ species pair

39 Steps toward the canonical sequence Entry selec)on Run BLAST similarity searches to iden)fy addi)onal sequences for the same gene Iden)fy homologs by reciprocal BLAST and phylogeny based resources Lock selected entries for other curators to prevent duplica)on

40 Steps toward the canonical sequence Prepare sequence alignments with T-Coffee, Muscle, Clustal W Merge into the canonical sequence: - most prevalent - most similar to orthologs sequences found in other species - based on length and aa composi)on it allows the clearest descrip)on - default: longest record conflicts and varia)ons

41 Sequence Analysis Several analysis programs are applied to the sequences for: - topological features - post-transla)onal modifica)ons - domains all results are manually checked and in- or excluded for annota)on

42 Topological Analysis Tools Prediction Signal P Presence and location of signal peptides TargetP Presence and location of transit peptides Predotar Mitochondrial, plastid or ER targeting sequences ESKW Transmembrane domains MEMSAT Transmembrane domains TMHMM Transmembrane domains Phobius Discriminates transmembrane and signal regions

43 Post-transla)onal modiﬁca)on Analysis Tools Prediction GPI-predictor GPI lipid anchor sites NetNGlyc N-glycosylation sites NetOGlyc O-glycosylation sites NMT Predictor N-terminal myristoylation sites Sulfinator Tyrosine sulfatation sites

44 Domain Analysis Tools Prediction ps_scan internal PROSITE profile, pattern and rule scanning InterPro retrieves non-prosite motif matches using InterPro database or InterProScan Coils Coiled-coils regions polyaa internal program which identifies homopolymeric stretches of amino acids REPEAT identifies the following repeats: Ankyrin, Armadillo, HAT, HEAT, Kelch, Leucine-rich, PFTA, PFTB, RCC1, TPR, WD40

taken from http://www.uniprot.org/docs/sop_manual_curation.

45 taken from Figure 1. UniProtKB sequence analysis results displayed in graphical interface

46 Literature Cura)on Iden)fica)on of relevant scien)fic literature from - literature and text mining resources (PubMed, Europe PMC, ihop, TextPresso) - addi)ons from other sources made by the curator Informa)on is extracted form the full text: - general annota)ons (not posi)on specific) - posi)on specific annota)ons

47 General Annota)ons hmp:// general_annota)on posi)on-independent contains mostly general biological informa)on like: func)ons, cataly)c ac)vity, cofactor, enzyme regula)on, subunit structure, pathway,...

48 Sequence Annota)ons posi)on dependent hmp:// sequence_annota)on regions or sites of interest like post-transla)onal modifica)ons, binding sites, ac)ve sites, etc. contains several subsec)ons: molecule processing, regions, sites, amino acid modifica)ons, natural variants, experimental info, secondary structure

49 Family-based Cura)on Evalua)on and cura)on of homologs as described above Standardiza)on of annota)on of homologs Propaga)on of annota)on across the homologs to ensure consistency

50 Evidence AMribu)on Every annota)on is amributed to its original source Every annota)on can be traced back and evaluated For evidence dis)nc)on there are 7 codes from the Evidence Code Ontology (ECO) used for manually curated entries hmp:// Addi)onal GO term annota)on

51 ECO code Term name Usage ECO: experimental evidence used in manual assertion Information for which there is published experimental evidence ECO: non-traceable author statement used in manual assertion Information based on author statements in scientific articles for which there is no ECO: ECO: ECO: ECO: ECO: sequence similarity evidence used in manual assertion imported information used in manual assertion curator inference used in manual assertion match to sequence model evidence used in manual assertion combinatorial evidence used in manual assertion taken from experimental support Information which has been propagated from a related experimentally characterised protein Information which has been imported from another database and manually verified Information which has been inferred by a curator based on his/her scientific knowledge or on the scientific content of an article Information originating from the UniProt automatic annotation systems or any of the sequence analysis programs used during the manual curation process and which has been manually verified Information which is manually curated based on a combination of experimental and computational evidence

52 Quality Control and Integra)on Finished entries run through a series of rulebased checked concerning especially posi)ons and regions All errors are corrected Manually reviewed by a senior curator Finally it is integrated into the database Unlock the finished entries for further cura)on

53 Demostra)on hmp:// P62756#sec)on_features

54 The Swiss-Prot Flat File hmp://web.expasy.org/docs/userman.html An entry is composed by different line types Line types have their own format Follows EMBL Nucleo)de Sequence Database format as close as possible 2 sec)ons: - core data (sequence data, cita)on info, taxonomy) - annota)ons (func)on, modifica)on, domains, sec and quart structure, disease associa)ons, conflicts, asf)

55 The following table lists the available two-letter line codes. Each code is followed by three blanks. Line Code Content Occurence in an entry ID Identification Once; starts the entry AC Accession number(s) Once or more DT Date Three times DE Description Once or more GN Gene name(s) Optional OS Organism species Once or more OG Organelle Optional OC Organism classification Once or more OX Taxonomy cross-reference Once OH Organism host Optional --continued--

56 Line Code Content Occurence in an entry RN Reference number Once or more RP Reference position Once or more RC Reference comment(s) Optional RX Reference cross-reference(s) Optional RG Reference group Once or more (Optional if RA line) RA Reference authors Once or more (Optional if R line) RT Reference title Optional RL Reference location Once or more CC Comments or notes Optional DR Database cross-references Optional PE Protein existence Once KW Keywords Optional FT Feature table data Once or more in Swiss-Prot, optional in TrEMBL SQ Sequence header Once (blanks) Sequence data Once or more // Termination line Once; ends the entry

57 Fields in More Detail ID line: ID EntryName Status; SequenceLength. EntryName: up to 11 uppercase alphanumeric characters X_Y - X is a mnemonic code of at most 5 alphanumeric characters - Y is a mnemonic species iden)fica)on code of at most 5 alphanumeric characters ID CYC_BOVIN Reviewed; 104 AA.

58 AC line: AC AC_number_1;[ AC_number_2;]...[ AC_number_N;] Accession number: 6 or 10 characters [A-N,R-Z] [0-9] [A-Z] [A-Z, 0-9] [A-Z, 0-9] [0-9] [O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9] [A-N,R-Z] [0-9] [A-Z] [A-Z, 0-9] [A-Z, 0-9] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9] RegEx: [OPQ][0-9][A-Z0-9]{3}[0-9] [A-NR-Z][0-9] ([A-Z][A-Z0-9]{2}[0-9]){1,2} Examples: P12345, Q1AAA9, A0A022YWF9

59 DT line: date, DD-MMM-YYYY always one of the biweekly release dates always three lines: - date of integra)on - date of sequence version, sequence version X - date of entry version, entry version X Example: DT 01-FEB-1999, integrated into UniProtKB/TrEMBL. DT 15-OCT-2000, sequence version 2. DT 15-DEC-2004, entry version 5.

60 DE lines: - three categories and addi)onal subcategories - contains a recommended name - besides: full name, short name, EC number - alterna)ve names: e.g. as an allergen or in biotechnology,...

61 DE RecName: Full=Annexin A5; DE Short=Annexin-5; DE AltName: Full=Annexin V; DE AltName: Full=Lipocor)n V; DE AltName: Full=Endonexin II; DE AltName: Full=Calphobindin I; DE AltName: Full=CBP-I; DE AltName: Full=Placental an)coagulant protein I; DE Short=PAP-I; DE AltName: Full=PP4; DE AltName: Full=Thromboplas)n inhibitor; DE AltName: Full=Vascular an)coagulant-alpha; DE Short=VAC-alpha; DE AltName: Full=Anchorin CII; DE RecName: Full=Granulocyte colony-s)mula)ng factor; DE Short=G-CSF; DE AltName: Full=Pluripoie)n; DE AltName: Full=Filgras)m; DE AltName: Full=Lenogras)m; DE Flags: Precursor;

62 OS line: origina)ng organism OS Homo sapiens (Human). OS Rous sarcoma virus (strain Schmidt-Ruppin A) (RSV-SRA) (Avian leukosis OS virus-rsa). OC lines: contain the taxonomic classifica)on of the source organism according to ( hmp:// OC Node[; Node...]. OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae; OC Homo.

63 RN, RP, RC, RX, RG, RA, RT, RL can occur mul)ple )me order in block fixed e.g: RN [1] RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORMS A AND C), FUNCTION, INTERACTION RP WITH PKC-3, SUBCELLULAR LOCATION, TISSUE SPECIFICITY, DEVELOPMENTAL RP STAGE, AND MUTAGENESIS OF PHE-175 AND PHE-221. RC STRAIN=Bristol N2; RX PubMed= ; DOI= /jbc.M ; RA Zhang L., Wu S.-L., Rubin C.S.; RT "A novel adapter protein employs a phosphotyrosine binding domain and RT excep)onally basic N-terminal domains to capture and localize an RT atypical protein kinase C: characteriza)on of Caenorhabdi)s elegans RT C kinase adapter 1, a protein that avidly binds protein kinase C3. ; RL J. Biol. Chem. 276: (2001).

64 CC lines free text contains most of the annotated informa)on CC -!- TOPIC: First line of a comment block; CC second and subsequent lines of a comment block. structured by predefined topics like: Allergen, Alterna)ve Products,.., Cofactor,..., Disease,.. Domain,..., Func)on, Interac)on,...

65 CC CC CC CC CC CC CC CC CC CC CC CC CC -!- ALLERGEN: Causes an allergic reaction in human. Minor allergen of! bovine dander.! -!- ALTERNATIVE PRODUCTS:! Event=Alternative initiation; Named isoforms=2;! Name=Alpha;! IsoId=P ; Sequence=Displayed;! Name=Beta;! IsoId=P ; Sequence=VSP_018696;! -!- SUBCELLULAR LOCATION: Cell membrane {ECO: }; Peripheral! membrane protein {ECO: }. Secreted {ECO: }. Note=The! last 22 C-terminal amino acids may participate in cell membrane! attachment.! -!- SUBCELLULAR LOCATION: Isoform 2: Cytoplasm {ECO: }.!!!

66 Cross References too many to enumerate extensive references with nucleo)de databases, e.g.: in EMBL FT CDS FT /protein_id="caa FT /db_xref="swiss-prot:p26345 FT /gene="reca FT /product="reca protein in Swiss=Prot DR EMBL; AJ297977; CAC ; -; Genomic_DNA. DR EMBL; X56491; CAA ; ALT_FRAME; mrna.

67 Key Words / Feature Table KW Keyword[; Keyword...]. helps to search resp. index the database no limits: KW 3D-structure; Alterna)ve splicing; Alzheimer disease; Amyloid; KW Apoptosis; Cell adhesion; Coated pits; Copper; KW Direct protein sequencing; Disease muta)on; Endocytosis; KW Glycoprotein; Heparin-binding; Iron; Membrane; Metal-binding; KW Notch signaling pathway; Phosphoryla)on; Polymorphism; KW Protease inhibitor; Proteoglycan; Serine protease inhibitor; Signal; KW Transmembrane; Zinc. Feature table like GenBank/EMBL/DDBJ

68 Programma)c Access hmp:// programma)c_access (remember this link!) several use cases documented, but not as an API best way: use the web interface to construct/ refine your query first before you try to automate the process

69 Retrieving an Individual Entry uses simple URL which can be bookmarked for individual entries: hmp:// default result is a web page alterna)ve formats: txt, xml, rdf, fasta, gff specified via the accession suffix structured formats like xml or rdf can include referenced entries

70 Using the ID mapping service hmp:// programma)c_access#batch_retrieval_perl_exa mple uses hmp POST method converts between different database IDs you have to know the specific abbrevia)on for the respec)ve databases

71 Retrieving Entries via Queries uses hmp GET method i.e. the query string is part of the URL structure might be quite complex use the browser to configure the query string more seong are available via the query builder hmp:// the URL length might be limited to 1000 characters

72 Examples hmp:// hmp:// hmp:// UniRef90_P04259.xml hmp:// UniRef90_P04259.rdf hmp:// UniRef90_P04259.fasta hmp:// UniRef90_P04259.tab

Bioinforma)cs Resources XML / Web Access

Bioinforma)cs Resources XML / Web Access Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J. Reeb Ins)tut für Informa)k I12 XML Infusion (in 10 sec) compila)on from hkp://www.w3schools.com/xml/default.asp