How can we use hashes to count?

Size: px

Start display at page:

Download "How can we use hashes to count?"

Elizabeth Perkins
6 years ago
Views:

1 How can we use hashes to count?

2 #!/usr/bin/perl -w use strict; = qw/ a a b c a b c d e a b c a b c d a a a a a a a b/; my %count; foreach ) { $count{$_}++; } # Then, if you want to sort the results by value: foreach ( sort { $count{$a} <=> $count{$b} } keys %count ) { print "$_ => $count{$_}\n"; }

3 Exercise Write a script that counts all words in a text file. Text: This is is an example Result: This => 1 example => 1 an => 1 is => 2

4 #!/usr/bin/perl -w use strict; my %count = (); while(<>) { chop; s/[.,:;!(){}]//g; = split; foreach (@words) { $count{$_} = 0; $count{$_}++; } } foreach ( sort { $count{$a} <=> $count{$b} } keys %count ) { print "$_ => $count{$_}\n"; }

5 Packages Collect related code together logical separation of code Create a namespace within a program identifies data and subroutines prevents name collisions

6 Creating a Package declare using the package function write code any following code becomes part of the namespace of the package that was previously declared package mypackage; sub mysub1 { ##... }

7 Using Packages Access subroutines by using the full name package::subroutine Variable prefixes still apply!!!

8 What's in a Name? Each package maintains it own namespace. Duplicate names within a single namespace are not allowed. The package command switches the namespace. It is much cleaner to remain in the main namespace and use the fully qualified name rather than jumping between packages.

9 Variable Types 1. package variables: regular generic variables accessible through fully qualified name analogous to public variables 2. lexical variables: prefixed by the my command accessible only within the code block NOT analogous to private variables

10 Default Declarations All variables are assumed to be package variables and can be seen through-out the package, unless otherwise directed... package test; sub print { $i = 1; checkcount(); print $i; ## what does this print? } sub checkcount { for ($i = 0; $i < 10; $i++) { #do something } }

11 Lexical Variables are Better Declaring the variables in our package with my helps to prevent namespace conflicts from within our package: package test; sub print { my $i = 1; checkcount; print $i; ## what does this print? } sub checkcount { for (my $i = 0; $i < 10; $i++) { #do something } }

12 Modules A module is a text file containing Perl code. Perl's mechanism for creating reusable code libraries. File placed in a specific directory hierarchy File named with a.pm extension Code within a module is brought into a program with the use statement

13 Modules and Packages Packages are often organized into modules, with one package per module. Thus when we use a module, we are bringing a package into our namespace To make our life easier, we usually name the package with the same name that we use for the module

14 Example: A Package as a Module file bioperl/sequence.pm package bioperl::sequence; ## bioperl sequence routines... 1 file main.pl use bioperl::sequence; ## bioperl sequence package is now ## imported into the main program

15 An Oddity All modules must end with a TRUE value or they will create an error at runtime. Typically, since modules usually contain code blocks and the last line is either blank or a }, most programmers place a bare 1 at the end of their module to satisfy Perl's insecurities. Explanation: Like a subroutine, the last line in a module is used as the return value for the use directive. Thus, we have to ensure that the last line has a positive value rather than a null value (like a blank line or a bare close bracket).

16 Making modules with h2xs 1. Go to the directory in which you want to create the module 2. Enter something like: H2xs XA n MyFirstModule 3. A module template MyFirstModule.pm is created among other stuff: let us have a look at it

17 Exercise Create a subroutine hello in MyFirstModule.pm Write a script to test this subroutine within MyFirstModule

18 BioPerl BioPerl is a collection of modules that facilitates the development of Perl scripts for bioinformatics applications.

19 BioPerl Objective of BioPerl: Develop reusable, extensible core Perl modules for use as a standard for manipulating molecular biological data. Background: Started in 1995 One of the oldest open source Bioinformatics Toolkit Project

20 Why Perl? Most of the primary biological data is still text. Perl has very powerful regular expression matching and string manipulation operators. Easy Web CGI scripting (see lecture 5)

21 What is BioPerl? Object oriented: Core objects. (sequences, structures) Re-usable collection of Perl modules that facilitate bioinformatics application development: Sequence manipulation. Accessing databases with different formats. (Genbank, PDB) Execution and Parsing of the results of molecular biology programs. (Blast, ClustalW)

22 Download and Install Bioperl

24 How does the code look like? #!/usr/local/bin/perl # Perform various calculations on a sequence use Bio::Seq; my $seq = Bio::Seq->new( -seq => 'ATGGGGGTGGTGGTACCCT', -id => 'human_id', -accession_number => 'AL000012', ); print $seq->seq(). \n ; # print the sequence print $seq->revcom->seq(). \n ; # print the reverse complement print $seq->translate->seq(). \n ; # print a translation to RNA

25 sequence objects BioPerl Modules Bio::Seq Bio::PrimarySeq Bio::LiveSeq alignment objects Bio::SimpleAlign Bio::UnivAln IO and DB objects

26 Sequence manipulation

27 Sequence Objects Bio::Seq default general purpose sequence representation Bio::PrimarySeq stripped down version of Seq Bio::LargeSeq genomic-sized (>100MB) sequences Bio::LiveSeq Seq whose features change over time

28 Structure of Bio::Seq Objects PrimarySeq common seq length subseq display_id accession_number desc primary_id moltype revcom trunc Seq only methods primary_seq annotation add_seqfeature top_seqfeatures all_seqfeatures feature_count species

29 Creating a Bio::Seq Object $seq = Bio::Seq->new(-seq => 'ATCGT', -desc => 'Sample sequence', -display_id => 'something', -accession_number => 'GB_ID', -moltype => 'dna');

30 Using the Bio::Seq Object $seq = Bio::Seq->new(...); ## print as a fasta file print >. $seq->accession_number().. $seq->desc()."\n"; for ($i = 0; $i" < length; $i+=70) { print $seq->subseq($i, $i+70). "\n"; } $protseq = $seq->translate();

31 A Little More Translation translate(stopdef, #stop char, def = '* unkchar, #unknown AA, def = 'X frame, #0, 1, or 2, def = 0 code, #codon table fullcds, #set to true for EMBL and GenBank style dieonerr #set to true to die on translation error )

32 Reading Data SeqIO provides a simple way of reading and writing sequences from/to files contains filters for most major sequence formats fasta GenBank EMBL SwissProt PIR GCG

33 Using Bio::SeqIO $in = Bio::SeqIO->new(-file => "inputfile", -format => 'EMBL'); $out = Bio::SeqIO->new(-file => ">outputfile", -format => 'Fasta'); while (my $seq = $in->next_seq()) { $out->write_seq($seq); } ## Alternatively, we can use the objects as ## if they are filehandles while (<$in>) { print $out $_; }

34 More on Bio::SeqIO Note that the return value of a next_seq() is a Bio::Seq object. Once we read in the object, we can work with it just like any other sequence object $in = Bio::SeqIO->new(-file => "inputfile", -format => 'EMBL'); while (my $seq = <$in>) { print $seq->accession_number(); }

35 Format conversion - sequences File format support: Fasta, GenBank, srf, pir, embl, raw, gcg, ace, bsml, game, swiss, phd, fastq Bio::SeqIO #a simple sequence converter Fasta to EMBL use Bio::SeqIO; $in = Bio::SeqIO->new(-file => "inputfilename", -format => 'Fasta'); $out = Bio::SeqIO->new(-file => ">outputfilename", -format => 'EMBL'); while ( my $seq = $in->next_seq() ) { $out->write_seq($seq); }

36 Format Conversion Alignments Alignment formats supported: INPUT: fasta, selex (HMMER), bl2seq, clustalw (.aln), msf (GCG), psi (PSI-BLAST), mase (Seaview), stockholm, prodom, water, phylip (interleaved), nexus, mega, meme OUTPUT: fasta, clustalw, mase, selex, msf/gcg, and phylip (interleaved). Next_aln( ) and write_aln( ) methods of the Bio::AlignIO object are used

37 Cool Tools Bio::Tools::SeqStats molecular weight, residue occurrence counts Bio::Tools::RestrictionEnzyme cleaves a sequence Bio::Tools::OddCodes hydropathy and charges

38 Obtaining basic sequence statistics- molecular weights, residue & codon frequencies (SeqStats, SeqWord) Molecular Weight Monomer Counter Codon Counter DNA weights RNA weights Amino Weights More

39 Example #!/usr/local/bin/perl use Bio::PrimarySeq; use Bio::Tools::SeqStats; my $seqobj = new Bio::PrimarySeq(-seq => 'ATCGTAGCTAGCTGA', -display_id => 'example1'); $seq_stats = Bio::Tools::SeqStats->new(-seq=>$seqobj); $hash_ref = $seq_stats->count_monomers(); foreach $base (sort keys %$hash_ref) { print "Number of bases of type ", $base, "= ",%$hash_ref->{$base},"\n"; }

40 Accessing databases

41 Direct Access to Databases You can create new Bio::Seq objects by importing them directly from a remote database using the Bio::DB family of modules GenBank (Bio::DB::GenBank) genpept (Bio::DB::GenPept) swissprot (Bio::DB::SwissProt) GDB (Bio::DB::GDB) ACEDB (Bio::DB::Ace)

42 Fetching from GenBank use Bio::Seq; use Bio::DB::GenBank; $gb = new Bio::DB::GenBank(); while { $seqs{$_} = $gb->get_seq_by_id($_); } This code creates a hash (%seqs) which contains a bunch of Bio::Seq objects, keyed by id (which we conveniently read in from the command line using array).

43 Accessing remote database Bioperl currently supports sequence data retrieval from the genbank, genpept, RefSeq, swissprot, and EMBL databases. $gb = new Bio::DB::GenBank(); $seq1 = $gb->get_seq_by_id('musighba1'); $seq2 = $gb->get_seq_by_acc('af303112');

44 Accessing local database Index Indexing local sequence data files. Support formats in genbank, swissprot, pfam, embl and fasta. Bio::Index Retrieve

45 Structures Reading PDB files PDB: a database containing protein structures. Main object is an StructureIO object Allows access to a variety of related Bio::Structure objects using a hierarchy (shown in next slide)

46 Reading PDB files Hierarchy in StructureIO object: Entry Models Chains» Residues => Atoms

47 Reading PDB files Among other functionality: XYZ coordinates of atom can be extracted into an array Subsequences can be extracted

48 Reading PDB files Core Code $in_structio = Bio::Structure::IO->new(-file => "1cbx.pdb", '-format' => 'pdb'); $struct_id = = = = = $struct->get_atoms($residues[0]);

49 Sequence Similarity Tools in BioPerl

50 Smith Waterman Search Smith Waterman pairwise alignment Standard method for producing an optimal local alignment of two sequences Auxilliary Bioperl-ext library required SW algorithm implemented in C and incorporated into bioperl Align_and_show() & Pairwise_alignment() in Bio::Tools::pSW module are methods used

51 Smith Waterman Search Core Code Use Bio::Tools::pSW, Bio::SeqIO, Bio::AlignIO $factory = new Bio::Tools::pSW( '-matrix' => 'BLOSUM62', '-gap' => 12, '-ext' => 2); $aln = $factory->pairwise_alignment($seq_array[0],$seq_array[1]); my $alnout = new Bio::AlignIO(-format => 'msf', -fh => \*STDOUT); $alnout->write_aln($aln);

52 Search Tools Blast output parsers Bio::Tools::Bplite regular BLAST Bio::Tools::Bppsilite PSIBLAST HMM parser Bio::Tools::HMMER::Results Hmmsearch hmmpfam

53 Example BPlite use Bio::Seq; use Bio::Tools::BPlite; $resfile = shift; $rep = Bio::Tools::BPlite(-fh => $resfile); $rep->query; while (my $hit = $rep->nextsbjct()) { $hit->name; while (my $hsp = $hit->nexthsp()) { $hsp->score(); } } This code iterates through the BLAST report, finding the scores of the HSPs for all reported hits.

54 Remote Execution of BLAST BioPerl has built in capability of running BLAST jobs remotely using RemoteBlast.pm Runs these jobs at NCBI automatically NCBI has dynamic configurations (server side) to always be up and ready Automatically updated for new BioPerl Releases

55 Example of Remote Blast A script to run a remote blast would be something like the following skeleton: In this example we are running a blastp (pairwise comparison) using the nr (nonredundant) database and a e-value threshold of 1e-10. The sequences that are being compared are located in the file d:/data/unknown.fa. $remote_blast = Bio::Tools::Run::RemoteBlast->new( '-prog' => 'blastp', '-data' => 'ecoli', '-expect' => '1e-10' ); $r = $remote_blast->submit_blast("d:\\data\\unknown.fa"); while (@rids = $remote_blast->each_rid ) { foreach $rid ) { $rc = $remote_blast->retrieve_blast($rid); } }

56 Parsing BLAST and FASTA Reports From the report: overall attributes (e.g. the query) ``hits'' can be accessed. Individual high-scoring segment pairs for each hit can then be accessed. Main BioPerl objects in 1.2 are Search.pm/SearchIO.pm SearchIO is more robust and the preferred choice (will be continued to be supported in future releases) BPlite, BPpsilite, and BPbl2seq

57 Sample Script to Read and Parse BLAST Report # Get the report $searchio = new Bio::SearchIO (-format => 'blast', -file => $blast_report); $result = $searchio->next_result; # Get info about the entire report $result->database_name; $algorithm_type = $result->algorithm; # get info about the first hit $hit = $result->next_hit; $hit_name = $hit->name ; # get info about the first hsp of the first hit $hit->next_hsp; $hsp_start = $hsp->query->start;

58 Running BLAST Locally StandAloneBlast Bio::Tools::Run::StandAloneBlast Factory = ('program' => 'blastn', 'database' => 'ecoli.nt'); $factory = Bio::Tools::Run::StandAloneBlast->new(@params);

59 Examples # Setting parameters similar to RemoteBlast $input = Bio::Seq->new(-id =>"test query", -seq =>"ACTAAGTGGGGG"); $blast_report = $factory->blastall($input); # Blast Report Object that directly accesses parser while (my $sbjct = $blast_report->next_hit){ while (my $hsp = $sbjct->next_hsp){ print $hsp->score. " ". $hsp->subject->seqname. "\n"; } }

60 ClustalW Using BioPerl

61 ClustalW & ClustalX Multiple sequence alignment Generates pairwise alignments of all input sequences, then builds a phylogenetic tree to determine orders in constructing the alignment. ClustalX: graphical interface to ClustalW. On Unix: module load soft/clustalx

62 ClustalW and Profile Align ClustalW using BioPerl Clustalw program should be installed and environment variable CLUSTALDIR set Setting Parameters Build a factory Some parameters: 'ktuple', 'matrix', 'outfile', 'quiet Align( ) and Profile_align( ) methods used

63 ClustalW Core Code Example use Bio::SeqIO; use = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'outfile' => 'clustalw_out', 'quiet' => 1); $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params); $seq_array_ref = \@seq_array; $aln= $factory->align($seq_array_ref);

64 Profile Align (ClustalW) Also possible Profile Aligning Between 2 profiles Alignment and Unaligned sequence Core code (alignment and unaligned seq) $prof_aln = $factory->profile_align($aln,$seq); More Info: ClustalW manpage Use of TCoffee Very Similar to this

65 Conclusion Others features: Restriction enzyme, motif, exon gene prediction, annotation, phylogenic trees, bibliography, graphic, generic genome browser etc. Before you start to write your own code, check out the existing ones. When documentation is not helpful, check out examples.

Rules of Thumb. 1/25/05 CAP5510/CGS5166 (Lec 5) 1

Rules of Thumb. 1/25/05 CAP5510/CGS5166 (Lec 5) 1 Rules of Thumb Most sequences with significant similarity over their entire lengths are homologous. Matches that are > 50% identical in a 20-40 aa region occur frequently by chance. Distantly related homologs