INTRODUCTION TO BIOINFORMATICS

Similar documents
INTRODUCTION TO BIOINFORMATICS

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

Database Searching Using BLAST

Bioinformatics explained: BLAST. March 8, 2007

Tutorial 4 BLAST Searching the CHO Genome

Lecture 5 Advanced BLAST

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

CS313 Exercise 4 Cover Page Fall 2017

2. Take a few minutes to look around the site. The goal is to familiarize yourself with a few key components of the NCBI.

Basic Local Alignment Search Tool (BLAST)

BLAST Exercise 2: Using mrna and EST Evidence in Annotation Adapted by W. Leung and SCR Elgin from Annotation Using mrna and ESTs by Dr. J.

Assessing Transcriptome Assembly

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

visualize and recover Grapegen Affymetrix Genechip Probeset Initial page: Optimized for Mozilla Firefox 3 (recommended browser)

Pairwise Sequence Alignment. Zhongming Zhao, PhD

Genome Browsers - The UCSC Genome Browser

Bioinformatics Hubs on the Web

Similarity Searches on Sequence Databases

Introduction to Phylogenetics Week 2. Databases and Sequence Formats

MetaPhyler Usage Manual

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Introduction to Computational Molecular Biology

BGGN 213 Foundations of Bioinformatics Barry Grant

How to Run NCBI BLAST on zcluster at GACRC

Sequence alignment theory and applications Session 3: BLAST algorithm

Tutorial: chloroplast genomes

Heuristic methods for pairwise alignment:

Tutorial: How to use the Wheat TILLING database

Introduction to Bioinformatics Problem Set 3: Genome Sequencing

Browser Exercises - I. Alignments and Comparative genomics

BLAST MCDB 187. Friday, February 8, 13

New generation of patent sequence databases Information Sources in Biotechnology Japan

User Guide for DNAFORM Clone Search Engine

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

Tutorial 1: Exploring the UCSC Genome Browser

Bioinformatics explained: Smith-Waterman

Homology Modeling FABP

Lab 4: Multiple Sequence Alignment (MSA)

BIR pipeline steps and subsequent output files description STEP 1: BLAST search

Data Mining Technologies for Bioinformatics Sequences

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

Biostatistics and Bioinformatics Molecular Sequence Databases

Introduction to BLAST with Protein Sequences. Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 6.2

Exercise 2: Browser-Based Annotation and RNA-Seq Data

BioExtract Server User Manual

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1

BLAST, Profile, and PSI-BLAST

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

How to use KAIKObase Version 3.1.0

VectorBase Web Apollo April Web Apollo 1

Having a BLAST Data Mining in Oracle 10g:

Creating and Using Genome Assemblies Tutorial

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

Alignments BLAST, BLAT

NCBI BLAST: a better web interface

Abstract. of biological data of high variety, heterogeneity, and semi-structured nature, and the increasing

This tutorial will show you how to conduct a BLAST search. With BLAST you may:

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

BIOINFORMATICS A PRACTICAL GUIDE TO THE ANALYSIS OF GENES AND PROTEINS

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

Geneious 5.6 Quickstart Manual. Biomatters Ltd

mpmorfsdb: A database of Molecular Recognition Features (MoRFs) in membrane proteins. Introduction

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

CLC Sequence Viewer 6.5 Windows, Mac OS X and Linux

Biology 644: Bioinformatics

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

BLAST - Basic Local Alignment Search Tool

Tutorial: Using the SFLD and Cytoscape to Make Hypotheses About Enzyme Function for an Isoprenoid Synthase Superfamily Sequence

BLAST & Genome assembly

Genome Browsers Guide

CLC Server. End User USER MANUAL

Finding Selection in All the Right Places TA Notes and Key Lab 9

Bioinformatics Database Worksheet

Fast-track to Gene Annotation and Genome Analysis

MCB Perl: scalars, STDIN Databanks, Blast homology. J. Peter Gogarten Office: BPB 404 phone: ,

Daniel H. Huson and Stephan C. Schuster with contributions from Alexander F. Auch, Daniel C. Richter, Suparna Mitra and Qi Ji.

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.

BLAST & Genome assembly

The UCSC Gene Sorter, Table Browser & Custom Tracks

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

Scientific Programming Practical 10

BLAST. NCBI BLAST Basic Local Alignment Search Tool

What do I do if my blast searches seem to have all the top hits from the same genus or species?

Redbooks Paper Yinhe Cheng Joyce Mak Chakarat Skawratananond Tzy-Hwa Kathy Tzeng

An I/O device driver for bioinformatics tools: the case for BLAST

Introduction to Genome Browsers

Finding homologous sequences in databases

ISsaga Manual Insertion Sequence semi-automatic genome Annotation

Tutorial: Resequencing Analysis using Tracks

Introduction to Bioinformatics Software on Bio-Linux

Viewing Molecular Structures

Similarity searches in biological sequence databases

ESG: Extended Similarity Group Job Submission

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

Sequence Alignment: BLAST

Finding data. HMMER Answer key

HORIZONTAL GENE TRANSFER DETECTION

Transcription:

Molecular Biology-2017 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain sequence information. Link to NCBI web site: http://www.ncbi.nlm.nih.gov/ GENERAL SEARCH 1. The first tool we will explore is the basic search engine. Similar to google, you can enter any combination of search terms or the specific accession number of the sequence of interest in the search box. You can also specify which database to search from the drop down menu to the left of the search box. 2. Let s say we are interested in finding information relative to myosin, a muscle protein. Enter the word myosin in the search box and then click on Search. A new page will be displayed, as shown on the next page, showing the number of records found within the different databases.

Molecular Biology-2017 2 3. The databases most frequently used in this course are the nucleotide and the protein databases. Click on the nucleotide database to obtain the following page: 4. To refine your search, you may then choose from the menus on the left the species, molecule type or the specific taxon from the top organisms on the menu on the right. For

Molecular Biology-2017 3 this example, we will first choose mrna from the menu for molecule type. Then from the new window that is displayed, we will choose records specific to zebra fish (Danio rerio) from the top taxon s menu. 5. A list of records corresponding to your search criteria will then be displayed. From there, you can then search and access the specific record of interest. Information that can be obtained from these records is explained further on in this exercise. 6. For your assignment, use this approach to find the Protein accession number of the first record for actin, cytoplasmic 2 isoform 1 from the mouse (Mus musculus). 7. Use the general search engine to obtain the record with the accession number NG_009024. Once you ve obtained the record, answer the following questions for your assignment. Is this a nucleotide or a protein record? What was the source of the sequence; protein, mrna, or genomic DNA?

Molecular Biology-2017 4 SEARCHING WITH A NUCLEOTIDE SEQUENCE 1. The most common search engine used with either nucleotide or protein sequences is the Basic Local Alignment Search Tool (BLAST). You can access this search engine either from the popular resources menu on the right, or through the Resource list (A-Z) menu on the left.. 2. Resource List (A-Z) : On this page can be found most of the links you will be using throughout the year.

Molecular Biology-2017 5 3. Let s explore Blast. Click on the link Blast. You should obtain the following page. BLAST is a set of similarity search engines designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. Nucleotide blast compares a nucleotide sequence against a nucleotide sequence database. Protein blast Compares an amino acid query sequence against a protein sequence database. Blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. Tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. Tblastx compares a translated nucleotide sequence against a nucleotide sequence database dynamically translated in all reading frames.

Molecular Biology-2017 6 We will first use this program to gain information on different sequences that you will be working with. Note that one of these sequences represents the plasmid insert which you must verify in lab exercise 2. 4. Click on the nucleotide BLAST (Blastn) option. You should obtain the following page: 5. Before we can enter a sequence query, we must make sure that the format of the latter be one that is compatible with the program. Most sequence analysis software can handle a format called FASTA. The FASTA format is a text file, without any numbers or any other annotation which is preceded by a descriptive line of text. Here is an example: >John s sequence123 (Press enter after this line) AACGTCGGATTCAGGTACCCAGGAAAACTACATCTC The first line of your file must begin with the following symbol :">". This symbol informs the program that this line of text is for descriptive purposes only and that the sequence information starts on the next line. You can write anything to identify the sequence on this line. The next line represents the actual sequence. 6. Obtain the text document of unknown sequences available on the BIO3151 web page, by following the link: Sequences>Unknown genes. This document contains five sequences numbered 1-5. Convert each of these to FASTA format. You can do this in NOTEPAD.

Molecular Biology-2017 7 7. Copy and paste the first sequence into the nucleotide blast query box. Choose the database on which the search will be performed in the Choose Search Set menu. Choose other and "nucleotide collection (nr/nt)" from the drop down menu. 8. Now choose the program to do the search from the Program Selection menu. Choose: Somewhat similar sequences (blastn). Check the box "Show results in a new page" to display the results in a new browser window. 9. Click on BLAST. A new page will appear asking you to wait for the completion of your request. This may be quite fast or slow depending on how heavily the demands on the NCBI server are.

Molecular Biology-2017 8 10. Once your request has been completed a new page will appear, as shown below, indicating the results of your search. 11. Before analyzing the results, we will change the formatting options. Click on Formatting options at the top of the page. A new menu will appear as shown below: Choose the option Old view and then click on Reformat

Molecular Biology-2017 9 12. The potential matches to your sequence will now be presented in three formats. A graphical format such as the following: If you scroll down, a textual format such as this one:

Molecular Biology-2017 10 And further down, the actual sequence alignments: For this exercise, the format we are interested in is the list of different records representing matches. Amongst the information that can be obtained are the following values: Query coverage: This value indicates what extent of your input sequence (original query) matches the sequence record found. For instance if the original query is 631 nucleotides long and BLAST can align all 631 nucleotides of this query against a hit, then that would be 100% coverage. Remember, Query Coverage does not take into account the length of the hit, only the percentage of the query that aligns with the hit. The Expect value (E) is a parameter that describes the number of hits one can expect to see by chance when searching a database of a particular size. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. The lower the E-value, or the closer it is to zero, the more significant the match is.

Molecular Biology-2017 11 Ident. : BLAST calculates the percentage identity between the query and the hit in a nucleotide-to-nucleotide alignment. How do you explain the fact that more than one sequence possesses an identity of 100%? Note that some of the sequences represent whole genome sequences, for example the first one from this search. For this exercise you wish to obtain the sequence of the gene not the genome. These are sometimes followed by the letter G. Notice in the above example that the record followed by a G states a 100% identity but only 42% coverage. What does that mean? 13. Click on the accession number to view the record. You should obtain a record similar to the one shown below: To convert to FASTA 1 2 3 4 5 6 7 8

Molecular Biology-2017 12 14. Information that can be obtained from a nucleotide sequence record: The definition (#1): Provides a brief description of sequence; includes information such as source organism, gene name/protein name, or some description of the sequence's function. The accession number (#2): The unique identifier for a sequence record. Organism (#3): The formal scientific name for the source organism (genus and species). Source: (#4): Information including an abbreviated form of the organism name, sometimes followed by a molecule type.. CDS (#5): Coding sequence; region of nucleotides that corresponds with the sequence of amino acids in a protein (location includes start and stop codons). By clicking on this link you may obtain the mrna sequence from the Start to the Stop codons. o Gene = (#6): The name of the gene. o Product = (#7): The name of the gene s protein product. o Protein_id. (#8): This is the protein s accession number. By clicking on this link, you can obtain the protein record. 15. In several of the future exercises you will be required to obtain and save these sequences in FASTA format. To change the format to FASTA, choose FASTA at the top of the sequence record. You should be redirected to a page like the following one:

Molecular Biology-2017 13 16. You could now select and copy the description that is preceded by the symbol > as well as the sequence and paste it in the program such as Notepad if you wished to save the sequence in this format. 17. For your assignment, obtain the following information for each of the unknown sequences on this course s web site (Sequences > Unknown genes): Accession number Coverage Ident. E value The definition The organism from which this sequence was obtained The gene name The gene s product name The protein s accession number