BioPerl. General capabilities (packages)

Size: px

Start display at page:

Download "BioPerl. General capabilities (packages)"

Jane Neal
5 years ago
Views:

1 General capabilities (packages) Sequences fetching, reading, writing, reformatting, annotating, groups Access to remote databases Applications BLAST, Blat, FASTA, HMMer, Clustal, Alignment, many others Gene modeling Genscan, Sim4, Grail, Genemark, ESTScan, MZEF, EPCR XML formats GAME, BSML and AGAVE GFF Trees Genetic maps 3D structure Literature Graphics Biol Practical Biocomputing 1

2 Auxilliary packages possibly of less general interest require additional modules BioPerl-run running applications EMBOSS PISE Bioperl-ext extensions Bioperl-db and BioSQL Biol Practical Biocomputing 2

3 Simple use Bio::Perl; easy access to a small part of Bioperl's functionality in an easy to use manner use Bio::Perl; # this script will only work if you have an internet connection on the # computer you're using, the databases you can get sequences from # are 'swiss', 'genbank', 'genpept', 'embl', and 'refseq' my $seq_object = get_sequence('swiss',"roa1_human"); write_sequence(">roa1.fasta",'fasta',$seq_object); use Bio::Perl; my $seq = get_sequence('swiss',"roa1_human"); # uses the default database - nr in this case my $blast_result = blast_sequence($seq); write_blast(">roa1.blast",$blast_result); Biol Practical Biocomputing 3

4 Bio::Perl Bio::Perl has a number of other easy-to-use functions, including get_sequence - gets a sequence from standard, internet accessible databases read_sequence - reads a sequence from a file read_all_sequences - reads all sequences from a file new_sequence - makes a Bioperl sequence just from a string write_sequence - writes a single or an array of sequence to a file translate - provides a translation of a sequence translate_as_string - provides a translation of a sequence, returning back just the sequence as a string blast_sequence - BLASTs a sequence against standard databases at NCBI write_blast - writes a blast report out to a file Biol Practical Biocomputing 4

5 Sequence Objects Seq, PrimarySeq, LocatableSeq, RelSegment, LiveSeq, LargeSeq, RichSeq, SeqWithQuality, SeqI Common formats are interpreted automatically Simple formats - without features FASTA (Pearson), Raw, GCG Rich Formats - with features and annotations GenBank, EMBL Swissprot, GenPept XML - BSML, GAME, AGAVE, TIGRXML, CHADO Biol Practical Biocomputing 5

6 Sequences, Features and Annotations Sequence - DNA, RNA, Amino Acid Sequences are feature containers Feature - Information with a Sequence Location Annotation - Information without explicit Sequence location Parsing sequences Bio::SeqIO for automatically reading most types multiple drivers: genbank, embl, fasta,... Sequence objects Bio::PrimarySeq Bio::Seq Bio::Seq::RichSeq Biol Practical Biocomputing 6

7 Simple examples #!/bin/perl -w use Bio::Seq; $seq_obj = Bio::Seq->new( -seq => "aaaatgggggggggggccccgtt", -alphabet => 'dna' ); #!/bin/perl -w use Bio::Seq; $seq_obj = Bio::Seq->new( -seq => "aaaatgggggggggggccccgtt", -display_id => "#12345", -desc => "example 1", -alphabet => "dna" ); print $seq_obj->seq(); Biol Practical Biocomputing 7

8 Reading sequences from files & databases #!/bin/perl -w use Bio::SeqIO; $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' ); # if there is more than one sequence in the file while ($seq_obj = $seqio_obj->next_seq){ # print the sequence print $seq_obj->seq,"\n"; #!/bin/perl -w use Bio::DB::GenBank; $db_obj = Bio::DB::GenBank->new; $seq_obj = $db_obj->get_seq_by_id( AE ); Biol Practical Biocomputing 8

9 Getting sequences directly from database #!/bin/perl -w use Bio::DB::GenBank; # also Bio::DB::GenBank, Bio::DB::GenPept, Bio::DB::SwissProt, Bio::DB::RefSeq and Bio::DB::EMBL #keyword query $query_obj = Bio::DB::Query::GenBank->new( -query =>'gbdiv est[prop] AND Trypanosoma brucei [organism]', -db => 'nucleotide' ); $gb = new Bio::DB::GenBank; # this returns a Seq object : $seq1 = $gb->get_seq_by_id('musighba1'); # this also returns a Seq object : $seq2 = $gb->get_seq_by_acc('af303112'); # this returns a SeqIO object, which can be used to get a Seq object : $seqio = $gb->get_stream_by_id(["j00522","af303112"," "]); $seq3 = $seqio->next_seq; Biol Practical Biocomputing 9

10 Getting more sequence information Some methods accession_number() get the accession number display_id() get identifier string description() or desc() get description string seq() get the sequence as a string length() get the sequence length subseq($start, $end) get a subsequence (char string) translate() translate to protein (seq obj) revcom() reverse complement (seq obj) species() Returns an Bio::Species object #!/usr/bin/env perl use strict; use Bio::SeqIO; use Bio::DB::GenBank; my $genbank = new Bio::DB::GenBank; my $seq = $genbank->get_seq_by_acc('af060485'); # get a record by accession my $dna = $seq->seq; # get the sequence as a string my $id = $seq->display_id; # identifier my $acc = $seq->accession; # accession number my $desc = $seq->desc; # get the description print "ID: $id\naccession: $acc\ndescription: $desc\n$dna\n"; Biol Practical Biocomputing 10

11 Sequence Objects LOCUS ECORHO 1880 bp DNA linear BCT 26-APR-1993 DEFINITION E.coli rho gene coding for transcription termination factor. ACCESSION J01673 J01674 VERSION J GI: KEYWORDS attenuator; leader peptide; rho gene; transcription terminator. SOURCE Escherichia coli ORGANISM Escherichia coli Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia. REFERENCE 1 (bases 1 to 1880) AUTHORS Brown,S., Albrechtsen,B., Pedersen,S. and Klemm,P. TITLE Localization and regulation of the structural gene for transcription-termination factor rho of Escherichia coli JOURNAL J. Mol. Biol. 162 (2), (1982) MEDLINE PUBMED REFERENCE 2 (bases 1 to 1880) AUTHORS Pinkham,J.L. and Platt,T. TITLE The nucleotide sequence of the rho gene of E. coli K-12 COMMENT FEATURES Original source text: Escherichia coli (strain K-12) DNA. A clean copy of the sequence for [2] was kindly provided by J.L.Pinkham and T.Platt. Location/Qualifiers source /organism="escherichia coli" /mol_type="genomic DNA" /strain="k-12" /db_xref="taxon:562" mrna 212..>1880 /product="rho mrna" gene /gene="rho" CDS /gene="rho" /note="transcription termination factor" /codon_start=1 /translation="mnltelkntpvselitlgenmglenlarmrkqdiifailkqhak... IDAMEFLINKLAMTKTNDDFFEMMKRS" ORIGIN 15 bp upstream from HhaI site. 1 aaccctagca ctgcgccgaa atatggcatc cgtggtatcc cgactctgct gctgttcaaa 61 aacggtgaag tggcggcaac caaagtgggt gcactgtcta aaggtcagtt gaaagagttc...deleted tgggcatgtt aggaaaattc ctggaatttg ctggcatgtt atgcaatttg catatcaaat 1861 ggttaatttt tgcacaggac // Biol Practical Biocomputing 11

12 Bio::Seq object methods add_seqfeature($feature) - attach feature(s) get_seqfeatures() - get all the attached features. species() - a Bio::Species object annotation() - Bio::Annotation::Collection Features Bio::SeqFeatureI - interface Bio::SeqFeature::Generic - basic implementation SeqFeature::Similarity - some score info SeqFeature::FeaturePair - pair of features Biol Practical Biocomputing 12

13 Sequence Features Bio::SeqFeatureI - interface - GFF derived start(), end(), strand() for location information location() - Bio::LocationI object (to represent complex locations) score,frame,primary_tag, source_tag - feature information spliced_seq() - for attached sequence, get the sequence spliced. Bio::SeqFeature::Generic add_tag_value($tag,$value) - add a tag/value pair get_tag_value($tag) - get all the values for this tag has_tag($tag) - test if a tag exists get_all_tags() - get all the tags Biol Practical Biocomputing 13

14 Sequence Annotations Each Bio::Seq has a Bio::Annotation::Collection via $seq->annotation() Annotations are stored with keys like comment and get_annotations( comment ) $annotation-> add_annotation( comment,$an) Annotation::Comment comment field Annotation::Reference author,journal,title, etc Annotation::DBLink database,primary_id,optional_id,comment Annotation::SimpleValue Biol Practical Biocomputing 14

15 Sequences, Features, and Annotations Features Bio::Seq has-a Bio::Annotation::Collection has-a has-a Bio::SeqFeature::Generic Bio::Annotation::Comment has-a Annotations Bio::LocationI Biol Practical Biocomputing 15

16 Writing sequences write in a different format than read = reformatting use Bio::SeqIO; #convert swissprot to fasta format my $in = Bio::SeqIO->new(-format => swiss, -file => file.sp ); my $out = Bio::SeqIO->new(-format => fasta, -file => >file.fa );` while( my $seq = $in->next_seq ) { $out->write_seq($seq); Biol Practical Biocomputing 16

17 Remote Blast Retrieve sequence, setup and submit use Bio::DB::GenBank; use Bio::Tools::Run::RemoteBlast; # retrieve sequence from genbank my $db_obj = Bio::DB::GenBank->new; my $seq_obj = $db_obj->get_seq_by_acc( ' ' ); my $seq = $seq_obj->seq; print "seq:$seq\n"; #remote BLAST setup and query submission my $v = 1; # turn on verbose output my $remote_blast = Bio::Tools::Run::RemoteBlast->new( '-prog' => 'blastp', '-data' => 'swissprot', '-expect' => '1e-10' ); my $r = $remote_blast->submit_blast( $seq_obj ); print STDERR "waiting " if( $v > 0 ); Biol Practical Biocomputing 17

18 Remote Blast Retrieve sequence, setup and submit WARNING MSG: Unrecognized DBSOURCE data: pdb: molecule 2NLL, chain 65, release Aug 27, 2007; deposition: Nov 20, 1996; class: TranscriptionDNA; source: Mol_id: 1; Organism_scientific: Homo Sapiens; Organism_common: Human; Genus: Homo; Species: Sapiens; Expression_system: Escherichia Coli; Expression_system_common: Bacteria; Expression_system_genus: Escherichia; Expression_system_species: Coli; Mol_id: 2; Organism_scientific: Homo Sapiens; Organism_common: Human; Genus: Homo; Species: Sapiens; Expression_system: Escherichia Coli; Expression_system_common: Bacteria; Expression_system_genus: Escherichia; Expression_system_species: Coli; Mol_id: 3; Synthetic: Yes; Mol_id: 4; Synthetic: Yes; Exp. method: X-Ray Diffraction seq:caicgdrssgkhygvyscegckgffkrtvrkdltytcrdnkdclidkrqrnrcqycryqkclamgm Biol Practical Biocomputing 18

19 Remote Blast Results list of search rids are stored in the remoteblast object #while = $remote_blast->each_rid ) { foreach my $rid ) { # Try to retrieve a search, $rc is not a reference until the search is done # when the serch is complete, $rc is a Bio::SearchIO object my $rc = $remote_blast->retrieve_blast($rid); if(!ref($rc) ) { # if the search is not done, wait 5 sec and try again # it would be a good idea to put a maximum limit here so the script # doesn't run forever in the event of an error if ( $rc < 0 ) { $remote_blast->remove_rid($rid); print STDERR "." if ( $v > 0 ); sleep 5; else { # search result successfully retrieved my $result = $rc->next_result(); # see Bio::Search::Result #save the output my $filename = $result->query_name()."\.out"; $remote_blast->save_output($filename); $remote_blast->remove_rid($rid); print "\nquery Name: ", $result->query_name(), "\n"; while ( my $hit = $result->next_hit ) {a # see Bio::Search::Hit::HitI next unless ( $v > 0); print "\thit name is ", $hit->name, "\n"; while( my $hsp = $hit->next_hsp ) { print "\t\tscore is ", $hsp->score, "\n"; Biol Practical Biocomputing 19

20 Remote Blast waiting... Query Name: 2NLL_A hit name is sp P RXRA_MOUSE score is 275 hit name is sp P RXRA_HUMAN score is 275 hit name is sp Q RXRA_RAT score is 275 hit name is sp Q RXRAB_DANRE score is 273 hit name is sp A2T929.2 RXRAA_DANRE score is 272 hit name is sp Q7SYN5.1 RXRBA_DANRE score is 270 hit name is sp Q RXRGA_DANRE score is 268 hit name is sp P RXRA_XENLA score is 268 hit name is sp P RXRG_CHICK score is 268 hit name is sp Q RXRBB_DANRE score is 266 hit name is sp Q0GFF6.2 RXRG_PIG score is 264 hit name is sp Q0VC20.1 RXRG_BOVIN score is 264 hit name is sp Q5BJR8.1 RXRG_RAT score is 264 hit name is sp Q5REL6.1 RXRG_PONAB score is 264 hit name is sp P RXRG_XENLA score is 264 hit name is sp P RXRG_HUMAN score is 264 hit name is sp P RXRG_MOUSE score is 264 hit name is sp Q6DHP9.1 RXRGB_DANRE score is 261 hit name is sp Q5TJF7.1 RXRB_CANFA score is 258 hit name is sp Q505F1.2 NR2C1_MOUSE score is 200 hit name is sp Q9TTR7.1 COT2_BOVIN score is 200 hit name is sp Q COT2_CHICK score is 200 hit name is sp P UP1_DROME score is 200 hit name is sp P UP2_DROME hit name is sp P COT2_HUMAN hit name is sp O COT2_RAT hit name is sp A0JNE3.1 NR2C1_BOVIN hit name is sp Q6PH18.1 N2F1B_DANRE hit name is sp Q9N4B8.4 NHR41_CAEEL hit name is sp O NHR49_CAEEL hit name is sp P HNF4_DROME hit name is sp P HNF4B_XENLA Biol Practical Biocomputing 20

21 Database Search BLAST - 3 Components Result: Bio::Search::Result::ResultI Hit: Bio::Search::Hit::HitI HSP: Bio::Search::HSP::HSPI Biol Practical Biocomputing 21

22 Blast use Bio::Perl; my $seq = get_sequence('swiss',"roa1_human"); # uses the default database - nr in this case my $blast_result = blast_sequence($seq); write_blast(">roa1.blast",$blast_result); $report_obj = new Bio::SearchIO(-format => 'blast', -file => 'report.bls'); while( $result = $report_obj->next_result ) { while( $hit = $result->next_hit ) { while( $hsp = $hit->next_hsp ) { if ( $hsp->percent_identity > 75 ) { print "Hit\t", $hit->name, "\n", "Length\t", $hsp->length('total'), "\n", "Percent_id\t", $hsp->percent_identity, "\n"; Biol Practical Biocomputing 22

23 BLAST Processed result Query is: BOSS_DROME Bride of sevenless protein precursor. 896 aa Matrix was BLOSUM62 Hit is F35H10.10 HSP Len is 315 E-value is 4.9e-11 Bit score 182 Query loc: Sbject loc: HSP Len is 28 E-value is 1.4e-09 Bit score 39 Query loc: Sbject loc: Biol Practical Biocomputing 23

24 BLAST Using the search::hit object use Bio::SearchIO; use strict; my $parser = new Bio::SearchIO(-format => blast, -file => file.bls ); while( my $result = $parser->next_result ){ while( my $hit = $result->next_hit ) { print hit name=,$hit->name, desc=, $hit->description, \n len=, $hit->length, acc=, $hit->accession, \n ; print raw score, $hit->raw_score, bits, $hit->bits, significance/evalue=, $hit->evalue, \n ; Biol Practical Biocomputing 24

25 Search::Hit methods start(), end() get overall alignment start and end for all HSPs strand() get best overall alignment strand matches() get total number of matches across entire set of HSPs can specify only exact id or conservative cons Biol Practical Biocomputing 25

26 Using Search::HSP use Bio::SearchIO; use strict; my $parser = new Bio::SearchIO(-format => blast, -file => file.bls ); while( my $result = $parser->next_result ){ while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { print hsp evalue=, $hsp->evalue, score= $hsp->score, \n ; print total length=, $hsp->hsp_length, qlen=, $hsp->query->length, hlen=,$hsp->hit->length, \n ; print qstart=,$hsp->query->start, qend=,$hsp->query->end, qstrand=, $hsp->query->strand, \n ; print hstart=,$hsp->hit->start, hend=,$hsp->hit->end, hstrand=, $hsp->hit->strand, \n ; print percent identical, $hsp->percent_identity, frac conserved, $hsp->frac_conserved(), \n ; print num query gaps, $hsp->gaps( query ), \n ; print hit str =, $hsp->hit_string, \n ; print query str =, $hsp->query_string, \n ; print homolog str=, $hsp->homology_string, \n ; Biol Practical Biocomputing 26

27 Search::HSP methods rank() order in the alignment by score, size matches seq_inds residue positions that are conserved, identical, mismatches, gaps Biol Practical Biocomputing 27

28 SearchIO object correspond to many results BLAST (WU-BLAST, NCBI, XML, PSIBLAST, BL2SEQ, MEGABLAST, TABULAR (-m8/m9)) FASTA (m9 and m0) HMMER (hmmpfam, hmmsearch) UCSC formats (WABA, AXT, PSL) Gene based alignments Exonerate, SIM4, {Gene,Genomewise Can write searches in alternative formats Biol Practical Biocomputing 28

29 Sequence Alignment Bio::AlignIO to read alignment files Produces Bio::SimpleAlign objects Phylip Clustal Interface and objects designed for round-tripping and some functional work Biol Practical Biocomputing 29

30 Graphics use Bio::Graphics; use Bio::SeqIO; use Bio::SeqFeature::Generic; my $file = shift or die "provide a sequence file as the argument"; my $io = Bio::SeqIO->new(-file=>$file) or die "couldn't create Bio::SeqIO"; my $seq = $io->next_seq or die "couldn't find a sequence in the file"; = $seq->all_seqfeatures; # sort features by their primary tags my %sorted_features; for my $f (@features) { my $tag = $f->primary_tag; my $panel = Bio::Graphics::Panel->new( -length => $seq->length, -key_style => 'between', -width => 800, -pad_left => 10, -pad_right => 10); $panel->add_track(arrow => Bio::SeqFeature::Generic->new(-start => 1, -end => $seq->length), -bump => 0, -double=>1, -tick => 2); $panel->add_track(generic => Bio::SeqFeature::Generic->new(-start => 1, -end => $seq->length, -bgcolor => 'blue', -label => 1,); # general case = qw(cyan orange blue purple green chartreuse magenta yellow aqua); my $idx = 0; for my $tag (sort keys %sorted_features) { my $features = $sorted_features{$tag; $panel->add_track($features, -glyph => 'generic', -bgcolor => $colors[$idx++ -fgcolor => 'black', -font2color => 'red', -key => "${tags", -bump => +1, -height => 8, -label => 1, -description => 1, ); print $panel->png; Biol Practical Biocomputing 30

31 Graphics Biol Practical Biocomputing 31

How can we use hashes to count?

How can we use hashes to count? #!/usr/bin/perl -w use strict; my @strings = qw/ a a b c a b c d e a b c a b c d a a a a a a a b/; my %count; foreach ( @strings ) { $count{$_}++; } # Then, if you want