Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Similar documents
Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Rules of Thumb. 1/25/05 CAP5510/CGS5166 (Lec 5) 1

Perl: Examples. # Storing DNA in a variable, and printing it out

How can we use hashes to count?

PERL NOTES: GETTING STARTED Howard Ross School of Biological Sciences

Unix, Perl and BioPerl

Perl for Biologists. Object Oriented Programming and BioPERL. Session 10 May 14, Jaroslaw Pillardy

Sequence analysis with Perl Modules and BioPerl. Unix, Perl and BioPerl. Regular expressions. Objectives. Some uses of regular expressions

What is bioperl. What Bioperl can do

1. HPC & I/O 2. BioPerl

BioPerl Tutorial I. Introduction I.1 Overview I.2 Software requirements I.2.1 Minimal bioperl installation I.2.2 Complete installation I.

Scientific Programming Practical 10

BioPerlTutorial - a tutorial for bioperl

BioPerl. General capabilities (packages)

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

Automating Data Analysis with PERL

Computational Theory MAT542 (Computational Methods in Genomics) - Part 2 & 3 -

Tutorial 4 BLAST Searching the CHO Genome

CSB472H1: Computational Genomics and Bioinformatics

Lezione 7. Bioinformatica. Mauro Ceccanti e Alberto Paoluzzi

VERY SHORT INTRODUCTION TO UNIX

Managing Your Biological Data with Python

Genome Browsers - The UCSC Genome Browser

INTRODUCTION TO BIOINFORMATICS

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

INTRODUCTION TO BIOINFORMATICS

Database Searching Using BLAST

Chromatin immunoprecipitation sequencing (ChIP-Seq) on the SOLiD system Nature Methods 6, (2009)

Using Biopython for Laboratory Analysis Pipelines

Beginning Perl for Bioinformatics. Steven Nevers Bioinformatics Research Group Brigham Young University

Genome 559 Intro to Statistical and Computational Genomics. Lecture 17b: Biopython Larry Ruzzo

Eval: A Gene Set Comparison System

Topics of the talk. Biodatabases. Data types. Some sequence terminology...

BIOS 546 Midterm March 26, Write the line of code that all Perl programs on biolinx must start with so they can be executed.

Lezione 7. BioPython. Contents. BioPython Installing and exploration Tutorial. Bioinformatica. Mauro Ceccanti e Alberto Paoluzzi

Programming Languages and Uses in Bioinformatics

from scratch A primer for scientists working with Next-Generation- Sequencing data CHAPTER 8 biopython

Biopython. Karin Lagesen.

Guide to Programming with Python. Algorithms & Computer programs. Hello World

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

Introduction to Biopython

Basic Local Alignment Search Tool (BLAST)

Tutorial: chloroplast genomes

PERL Bioinformatics. Nicholas E. Navin, Ph.D. Department of Genetics Department of Bioinformatics. TA: Dr. Yong Wang

Lecture 5 Advanced BLAST

Outline. Introduction to Perl. Why use scripting languages? What is expressiveness. Why use Java over C

Welcome to. Python 2. Session #5. Michael Purcaro, Chris MacKay, Nick Hathaway, and the GSBS Bootstrappers February 2014

BIFS 617 Dr. Alkharouf. Topics. Parsing GenBank Files. More regular expression modifiers. /m /s

Long Read RNA-seq Mapper

PERL Scripting - Course Contents

Indian Institute of Technology Kharagpur. PERL Part II. Prof. Indranil Sen Gupta Dept. of Computer Science & Engg. I.I.T.

Biopython: Python tools for computation biology

Patterns / Regular expressions

Circ-Seq User Guide. A comprehensive bioinformatics workflow for circular RNA detection from transcriptome sequencing data

COMS 3101 Programming Languages: Perl. Lecture 2

Advanced UCSC Browser Functions

Important Example: Gene Sequence Matching. Corrigiendum. Central Dogma of Modern Biology. Genetics. How Nucleotides code for Amino Acids

BLAST, Profile, and PSI-BLAST

MetaPhyler Usage Manual

Genomic Analysis with Genome Browsers.

(DNA#): Molecular Biology Computation Language Proposal

Using many concepts related to bioinformatics, an application was created to

Heuristic methods for pairwise alignment:

Unix, Perl and BioPerl

Genome Browsers Guide

Tutorial 1: Exploring the UCSC Genome Browser

ChIP-Seq Tutorial on Galaxy

Chen lab workshop. Christian Frech

Perl Scripting. Students Will Learn. Course Description. Duration: 4 Days. Price: $2295

CSCI 4152/6509 Natural Language Processing. Perl Tutorial CSCI 4152/6509. CSCI 4152/6509, Perl Tutorial 1

BMMB 597D - Practical Data Analysis for Life Scientists. Week 12 -Lecture 23. István Albert Huck Institutes for the Life Sciences

Min Wang. April, 2003

2 Algorithm. Algorithms for CD-HIT were described in three papers published in Bioinformatics.

Tutorial: How to use the Wheat TILLING database

Biostatistics and Bioinformatics Molecular Sequence Databases

Miniproject 1. Part 1 Due: 16 February. The coverage problem. Method. Why it is hard. Data. Task1

ChIP-seq Analysis Practical

Sequence Analysis with Perl. Unix, Perl and BioPerl. Why Perl? Objectives. A first Perl program. Perl Input/Output. II: Sequence Analysis with Perl

Exercise 2: Browser-Based Annotation and RNA-Seq Data

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

Motif Discovery using optimized Suffix Tries

Multiple Sequence Alignments

CLC Server. End User USER MANUAL

Input files: Trim reads: Create bwa index: Align trimmed reads: Convert sam to bam: Sort bam: Remove duplicates: Index sorted, no-duplicates bam:

Lecture 2, Introduction to Python. Python Programming Language

Introduction to Unix/Linux INX_S17, Day 8,

CAP BIOINFORMATICS Su-Shing Chen CISE. 8/19/2005 Su-Shing Chen, CISE 1

MSCBIO 2070/02-710: Computational Genomics, Spring A4: spline, HMM, clustering, time-series data analysis, RNA-folding

Public Repositories Tutorial: Bulk Downloads

Boolean Network Modeling

SMART SEQUENCE SIMILARITY SEARCH (S 4 ) SYSTEM. A Project. Presented to the. Faculty of. California State University, San Bernardino

Introduction to Perl. Perl Background. Sept 24, 2007 Class Meeting 6

Tiling Assembly for Annotation-independent Novel Gene Discovery

Genome 559 Intro to Statistical and Computational Genomics Lecture 18b: Biopython Larry Ruzzo (Thanks again to Mary Kuhner for many slides)

Introduction to BLAST with Protein Sequences. Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 6.2

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

RNA-seq. Manpreet S. Katari

How to submit nucleotide sequence data to the EMBL Data Library: Information for Authors

BLAST Exercise 2: Using mrna and EST Evidence in Annotation Adapted by W. Leung and SCR Elgin from Annotation Using mrna and ESTs by Dr. J.

TBtools, a Toolkit for Biologists integrating various HTS-data

Transcription:

CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 1/18/07 CAP5510 1

Molecular Biology Background 1/18/07 CAP5510 2

Central Dogma DNA Gene1 Gene2 Gene3 TFs activator repressor mrna Transcription Translation Protein 1/18/07 CAP5510 3

Enterobacteria phage lambda ORIGIN 5' end of the ecori-g fragment, polarity of l strand. 1 aattcttttg ctttttaccc tggaagaaat actcataagc cacctctgtt atttaccccc 61 aatcttcaca agaaaaactg tatttgacaa acaagataca ttgtatgaaa atacaagaaa 121 gtttgttgat ggaggcgata tgcaaactct ttctgaacgc ctcaagaaga ggcgaattgc 181 gttaaaaatg acgcaaaccg aactggcaac caaagccggt gttaaacagc aatcaattca 241 actgattgaa gctggagtaa ccaagcgacc gcgcttcttg tttgagattg ctatggcgct 301 taactgtgat ccggtttggt tacagtacgg aactaaacgc ggtaaagccg cttaagacat 361 tcccgctctt acacattcca gccctgaaaa agggcatcaa attaaaccac acctatggtg 421 tatgcattta tttgcataca ttcaatcaat tgttatctaa ggaaatactt acatatggtt 481 cgtgcaaaca aacgcaacga ggctctacga atcgagagtg cgttgcttaa caaaatcgca 1141 aatggtgcat ccctcaaaac gagggaaaat cccctaaaac gagggataaa acatccctca 1201 aattggggga ttgctatccc tcaaaacagg gggacacaaa agacactatt acaaaagaaa 1261 aaagaaaaga ttattcgtca gagaatt // DNA Gene for protein Cro 140-355 mrna a ugcaaacucu uucugaacgc cucaagaaga ggcgaauugc guuaaaaaug acgcaaaccg aacuggcaac caaagccggu guuaaacagc aaucaauuca acugauugaa gcuggaguaa ccaagcgacc gcgcuucuug uuugagauug cuauggcgcu uaacugugau ccgguuuggu uacaguacgg aacuaaacgc gguaaagccg cuuaa ORIGIN // 1 mqtlserlkk rrialkmtqt elatkagvkq qsiqlieagv tkrprflfei amalncdpvw 61 lqygtkrgka a Protein 1/18/07 CAP5510 4

The Genetic Code 1/18/07 CAP5510 5

Enterobacteria phage lambda mrna a ugcaaacucu uucugaacgc cucaagaaga ggcgaauugc guuaaaaaug acgcaaaccg aacuggcaac caaagccggu guuaaacagc aaucaauuca acugauugaa gcuggaguaa ccaagcgacc gcgcuucuug uuugagauug cuauggcgcu uaacugugau ccgguuuggu uacaguacgg aacuaaacgc gguaaagccg cuuaa Protein 1 mqtlserlkk rrialkmtqt elatkagvkq qsiqlieagv tkrprflfei amalncdpvw 61 lqygtkrgka a 1/18/07 CAP5510 6

DNA Transcription 1/18/07 CAP5510 7

Transcription Regulation CAT Box CCAAT Basal TF Binding Sites TATA Box TATAAT Transcription Start point Gene Exon 1 Exon 2 Intron enhancer Gene-Specific TF Binding Sites promoter region coding region 1/18/07 CAP5510 8

Transcription Initiation 1/18/07 CAP5510 9

Transcription 1/18/07 CAP5510 10

Transcription Steps RNA polymerase needs many transcription factors (TFIIA,TFIIB, etc.) (A) The promoter sequence (TATA box) is located 25 nucleotides away from transcription initiation site. (B) The TATA box is recognized and bound by transcription factor TFIID, which then enables the adjacent binding of TFIIB. DNA is somewhat distorted in the process. (D) The rest of the general transcription factors as well as the RNA polymerase itself assemble at the promoter. What order? (E) TFIIH then uses ATP to phosphorylate RNA polymerase II, changing its conformation so that the polymerase is released from the complex and is able to start transcribing. As shown, the site of phosphorylation is a long polypeptide tail that extends from the polymerase molecule. 1/18/07 CAP5510 11

Transcription Factors The general transcription factors have been highly conserved in evolution; some of those from human cells can be replaced in biochemical experiments by the corresponding factors from simple yeasts. 1/18/07 CAP5510 12

Protein Synthesis: Incorporation of amino acid into protein 1/18/07 CAP5510 13

1/18/07 CAP5510 14

Perl: Practical Extraction & Report Language Created by Larry Wall, early 90s Portable, glue language for interfacing C/Fortran code, WWW/CGI, graphics, numerical analysis and much more Easy to use and extensible OOP support, simple databases, simple data structures. From interpreted to compiled high-level features, and relieves you from manual memory management, segmentation faults, bus errors, most portability problems, etc, etc. Competitors: Python, Tcl, Java 1/18/07 CAP5510 15

Perl Features Bit Operations Pattern Matching Subroutines Packages & Modules Objects Interprocess Communication Threads, Process control Compiling 1/18/07 CAP5510 16

Managing a Large Project Devise a common data exchange format. Use modules that have already been developed. Write Perl scripts to convert to and from common data exchange format. Write Perl scripts to glue it all together. 1/18/07 CAP5510 17

What is Bioperl? Tookit of Perl modules useful for bioinformatics Open source; Current version: Bioperl 1.5.2 Routines for handling biosequence and alignment data. Why? Human Genome Project: Same project, same data. different data formats! Different input formats. Different output formats for comparable utility programs. BioPerl was useful to interchange data and meaningfully exchange results. Perl Saved the Human Genome Project Many routine tasks automated using BioPerl. String manipulations (string operations: substring, match, etc.; handling string data: names, annotations, comments, bibliographical references; regular expression operations) Modular: modules in any language 1/18/07 CAP5510 18

Miscellaneous ptk to enable building Perl-driven GUIs for X-Window systems. BioJava BioPython The BioCORBA Project provides an objectoriented, language neutral, platformindependent method for describing and solving bioinformatics problems. 1/18/07 CAP5510 19

Perl: Examples #!/usr/bin/perl -w # Storing DNA in a variable, and printing it out # First we store the DNA in a variable called $DNA $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # Next, we print the DNA onto the screen print $DNA; # Finally, we'll specifically tell the program to exit. exit; #perl1.pl 1/18/07 CAP5510 20

Perl: Strings #!/usr/bin/perl -w $DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; $DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA'; # Concatenate the DNA fragments $DNA3 = "$DNA1$DNA2"; print "Concatenation 1):\n\n$DNA3\n\n"; # An alternative way using the "dot operator": $DNA3 = $DNA1. $DNA2; print "Concatenation 2):\n\n$DNA3\n\n"; # transcribe from DNA to RNA; make rev comp; print; $RNA = $DNA3; $RNA =~ s/t/u/g; $rev = reverse $DNA3; $rev =~ tr/agctacgt/tcgatgca/; print "$RNA\n$rev\n"; exit; #perl2.pl 1/18/07 CAP5510 21

Perl: arrays #!/usr/bin/perl -w # Read filename & remove newline from string $protfile = <STDIN>; chomp $protfile; # First we have to "open" the file unless (open(proteinfile, $protfile) { print "File $protfile does not exist"; exit;} # Each line becomes an element of array @protein @protein = <PROTEINFILE>; print @protein; # Print line #3 and number of lines print $protein[2], "File contained ", scalar @protein, " lines\n"; # Close the file. close PROTEINFILE; exit; #perl3.pl 1/18/07 CAP5510 22

Perl: subroutines #!/usr/bin/perl w # using command line argument $dna1 = $ARGV[0]; $dna2 = $ARGV[1]; # Call subroutine with arguments; result in $dna $dna = addgacagt($dna1, $dna2); print "Add GACAGT to $dna1 & $dna2 to get $dna\n\n"; exit; ##### addacgt: concat $dna1, $dna2, & "ACGT". ##### sub addgacagt { my($dnaa, $dnab) = @_; my($dnac) = $dnaa.$dnab; $dnac.= GACAGT'; return $dnac; } #perl4.pl 1/18/07 CAP5510 23

How Widely is Bioperl Used? 1/18/07 CAP5510 24

What Can You Do with Bioperl? Accessing sequence data from local and remote databases Transforming formats of database/ file records Manipulating individual sequences Searching for similar sequences Creating and manipulating sequence alignments Searching for genes and other structures on genomic DNA Developing machine readable sequence annotations Types of Perl Objects: Sequence objects Location objects Interface objects Implementation objects 1/18/07 CAP5510 25

BioPerl Modules Bio::PreSeq, module for reading, accessing, manipulating, analyzing single sequences. Bio::UnivAln, module for reading, parsing, writing, slicing, and manipulating multiple biosequences (sequence multisets & alignments). Bio::Struct, module for reading, writing, accessing, and manipulating 3D structures. Support for invoking BLAST & other programs. Download URL [http://www.bioperl.org/core/latest/] Tutorial [http://www.bioperl.org/core/latest/bptutorial.html] Course [http://www.pasteur.fr/recherche/unites/sis/formation/bioperl/index. html] 1/18/07 CAP5510 26

BioPerl Sequence Object $seqobj->display_id(); # readable id of sequence $seqobj->seq(); # string of sequence $seqobj->subseq(5,10); # part of the sequence as a string $seqobj->accession_number(); # if present, accession num $seqobj->moltype(); # one of 'dna','rna','protein' $seqobj->primary_id(); # unique id for sequence independent # of its display_id or accession number 1/18/07 CAP5510 27

Example 1: Convert SwissProt to fasta format #! /local/bin/perl -w use strict; use Bio::SeqIO; my $in = Bio::SeqIO->newFh ( -file => '<seqs.html', -format => 'swiss' ); my $out = Bio::SeqIO->newFh ( -file => '>seqs.fasta', -format => 'fasta' ); print $out $_ while <$in>; exit; #bioperl1.pl 1/18/07 CAP5510 28

Example 2 : Load sequence from remote server #!/local/bin/perl -w use Bio::DB::GenBank; #!/usr/bin/perl -w use Bio::DB::SwissProt; $database = new Bio::DB::SwissProt; $seq = $database->get_seq_by_id('malk_ecoli'); my $out = Bio::SeqIO->newFh(-fh => STDOUT, -format => 'fasta'); print $out $seq; exit; my $gb = new Bio::DB::GenBank( -retrievaltype=>'tempfile', -format=>'fasta'); my ($seq) = $seq = $gb->get_seq_by_id("5802612"); print $seq->id, "\n"; print $seq->desc(), "Sequence: \n"; print $seq->seq(), "\n"; exit; 1/18/07 CAP5510 29

Sequence Formats in BioPerl #! /local/bin/perl -w use strict; use Bio::SeqIO; my $in = Bio::SeqIO->new ( -file => 'seqs.html', -format => 'swiss' ); my $out = Bio::SeqIO->new ( -file => 'seqs.fas', -format => 'fasta' ); while ($seq = $in->next_seq()) { $accnum = $seq->accession_number(); print Accession# = $accnum\n ; $out->write_seq($seq); } exit; #bioperl2.pl 1/18/07 CAP5510 30

BioPerl #!/usr/bin/perl w # define a DNA sequence object with given sequence $seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact', '-desc'=>'sample Bio::Seq object', '-display_id' => 'somethingxxx', '-accession_number' => 'accnumxxx', '-alphabet' => 'dna' ); $gb = new Bio::DB::GenBank(); $seq = $gb->get_seq_by_id('musighba1'); #returns Seq object $seq = $gb->get_seq_by_acc('af303112'); #returns Seq object # this returns a SeqIO object : $seqio = $gb->get_stream_by_batch([ qw(j00522 AF303112)])); exit; #bioperl3.pl 1/18/07 CAP5510 31

#!/local/bin/perl -w use Bio::DB::GenBank; $gb = new Bio::DB::GenBank(); Sequence Manipulations $seq1 = $gb->get_seq_by_acc('af303112'); $seq2=$seq1->trunc(1,90); $seq2 = $seq2->revcom(); print $seq2->seq(), "\n"; $seq3=$seq2->translate; print $seq3->seq(), "\n"; exit; #bioperl4.pl 1/18/07 CAP5510 32

use Bio::Structure::IO; BioPerl: Structure $in = Bio::Structure::IO->new(-file => "inputfilename", '- format' => 'pdb'); $out = Bio::Structure::IO->new(-file => ">outputfilename", '- format' => 'pdb'); # note: we quote -format to keep older perl's from complaining. while ( my $struc = $in->next_structure() ) { $out->write_structure($struc); } print "Structure ",$struc->id," number of models: ", scalar $struc->model,"\n"; 1/18/07 CAP5510 33

Sequence Features 1/18/07 CAP5510 34

BioPerl: Seq and SeqIO use Bio::SeqIO; $seqin = Bio::SeqIO->new(-format =>'EMBL', -file=> f1'); $seqout= Bio::SeqIO->new(-format =>'Fasta',-file=>'>f1.fa'); while ((my $seqobj = $seqin->next_seq())) { print "Seq: ", $seqobj->display_id, ", Start of seq ", substr($seqobj->seq,1,10),"\n"; if ( $seqobj->moltype eq 'dna') { $rev = $seqobj->revcom; $id = $seqobj->display_id(); $id = "$id.rev"; $rev->display_id($id); $seqout->write_seq($rev); } #end if foreach $feat ( $seqobj->top_seqfeatures() ) { if( $feat->primary_tag eq 'exon' ) { print STDOUT "Location ",$feat->start,":", $feat->end," GFF[",$feat->gff_string,"]\n";} } # end foreach } # end while exit; #bioperl6.pl 1/18/07 CAP5510 35

Example 3: Read alignment file using AlignIO class #!/usr/bin/perl -w use strict; use Bio::AlignIO; my $in = new Bio::AlignIO(-file => '<data/infile.aln', -format => 'clustalw'); # returns an aligni (alignment interface) my $aln = $in->next_aln(); print "same length of all sequences: ", ($aln->is_flush())? "yes" : "no", "\n"; print "alignment length: ", $aln->length, "\n"; printf "identity: %.2f %%\n", $aln->percentage_identity(); printf "identity of conserved columns: %.2f %%\n", $aln->overall_percentage_identity(); 1/18/07 CAP5510 36

Example 4: Standalone BLAST #!/usr/bin/perl -w use strict; use Bio::SeqIO; use Bio::Tools::Run::StandAloneBlast; my $seq_in = Bio::SeqIO -> new(-file => '<data/prot1.fasta', -format => 'fasta'); my $query = $seq_in -> next_seq(); my $factory = Bio::Tools::Run::StandAloneBlast -> new('program' => 'blastp', 'database' => 'swissprot', _READMETHOD => 'Blast'); my $blast_report = $factory->blastall($query); my $result = $blast_report->next_result(); while (my $hit = $result->next_hit()){ print "\thit name: ", $hit->name(), "Significance: ", $hit->significance(), "\n"; while (my $hsp = $hit->next_hsp()){ print "E: ", $hsp->evalue(), "frac_identical: ", $hsp->frac_identical(), "\n"; } } exit; 1/18/07 CAP5510 37