MetaPhyler Usage Manual

Similar documents
AMPHORA2 User Manual. An Automated Phylogenomic Inference Pipeline for Bacterial and Archaeal Sequences. COPYRIGHT 2011 by Martin Wu

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Database Searching Using BLAST

Tutorial 4 BLAST Searching the CHO Genome

BLAST, Profile, and PSI-BLAST

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

INTRODUCTION TO BIOINFORMATICS

Bioinformatics explained: BLAST. March 8, 2007

Basic Local Alignment Search Tool (BLAST)

INTRODUCTION TO BIOINFORMATICS

Environmental Sample Classification E.S.C., Josh Katz and Kurt Zimmer

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Introduction to Phylogenetics Week 2. Databases and Sequence Formats

Pairwise Sequence Alignment. Zhongming Zhao, PhD

Taxonomic classification of SSU rrna community sequence data using CREST

Install and run external command line softwares. Yanbin Yin

Sequence Alignment: BLAST

Lecture 5 Advanced BLAST

How to Run NCBI BLAST on zcluster at GACRC

diamond v February 15, 2018

Similarity Searches on Sequence Databases

RAMMCAP The Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline

Sequence alignment theory and applications Session 3: BLAST algorithm

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

BIR pipeline steps and subsequent output files description STEP 1: BLAST search

Homology Modeling FABP

Kraken: ultrafast metagenomic sequence classification using exact alignments

BLAST - Basic Local Alignment Search Tool

QIIME and the art of fungal community analysis. Greg Caporaso

This tutorial will show you how to conduct a BLAST search. With BLAST you may:

HORIZONTAL GENE TRANSFER DETECTION

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

HymenopteraMine Documentation

Assessing Transcriptome Assembly

Genomic Island Hunter (GIHunter)

The BLASTER suite Documentation

Alignments BLAST, BLAT

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

PART 1: GENOME BROWSING WITH ARTEMIS

BovineMine Documentation

MEGAN5 tutorial, September 2014, Daniel Huson

SeqGrapheR. Petr Novak May 10, 2010

OrthoMCL v1.4. Recall: Web Service: Datadoc v.1 1/29/ Algorithm Description (SCIENCE)

Finding data. HMMER Answer key

Daniel H. Huson. September 11, Contents 1. 1 Introduction 3. 2 Getting Started 5. 4 Program Overview 6. 6 The NCBI Taxonomy 9.

User Manual for MEGAN V6.10.6

BLAST MCDB 187. Friday, February 8, 13

Manual of mirdeepfinder for EST or GSS

Seminar III: R/Bioconductor

SeqGrapheR Manual. Petr Novak September 4, Introduction 1. 3 Input files Project 10

CS313 Exercise 4 Cover Page Fall 2017

Daniel H. Huson and Stephan C. Schuster with contributions from Alexander F. Auch, Daniel C. Richter, Suparna Mitra and Qi Ji.

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CLC Server. End User USER MANUAL

MetAmp: a tool for Meta-Amplicon analysis User Manual

ASAP - Allele-specific alignment pipeline

When you use the EzTaxon server for your study, please cite the following article:

2018/08/16 14:47 1/36 CD-HIT User's Guide

visualize and recover Grapegen Affymetrix Genechip Probeset Initial page: Optimized for Mozilla Firefox 3 (recommended browser)

CrocoBLAST: Running BLAST Efficiently in the Age of Next-Generation Sequencing

BLAST Exercise 2: Using mrna and EST Evidence in Annotation Adapted by W. Leung and SCR Elgin from Annotation Using mrna and ESTs by Dr. J.

2 Algorithm. Algorithms for CD-HIT were described in three papers published in Bioinformatics.

MetaCom Sample Data and Tutorial

Introduction to Bioinformatics Online Course: IBT

Fast-track to Gene Annotation and Genome Analysis

Mercury BLASTN Biosequence Similarity Search System: Technical Reference Guide

Data analysis of 16S rrna amplicons Computational Metagenomics Workshop University of Mauritius

What do I do if my blast searches seem to have all the top hits from the same genus or species?

BLAST & Genome assembly

Uploading sequences to GenBank

Introduc)on to annota)on with Artemis. Download presenta.on and data

Expressed sequence tags: analysis and annotation

The Design, Implementation, and Evaluation of

DNA / RNA sequencing

ESG: Extended Similarity Group Job Submission

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1

Introduction to Computational Molecular Biology

Phylogeny Yun Gyeong, Lee ( )

Mapping NGS reads for genomics studies

1 Abstract. 2 Introduction. 3 Requirements

VectorBase Web Apollo April Web Apollo 1

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

COMPARATIVE MICROBIAL GENOMICS ANALYSIS WORKSHOP. Exercise 2: Predicting Protein-encoding Genes, BlastMatrix, BlastAtlas

How to use KAIKObase Version 3.1.0

Heuristic methods for pairwise alignment:

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

Data Walkthrough: Background

biokepler: A Comprehensive Bioinforma2cs Scien2fic Workflow Module for Distributed Analysis of Large- Scale Biological Data

Tutorial: chloroplast genomes

Metagenome Processing and Analysis

Intro to NGS Tutorial

Managing big biological sequence data with Biostrings and DECIPHER. Erik Wright University of Wisconsin-Madison

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.

diamond Requirements Time Torque/PBS Examples Diamond with single query (simple)

MetaStorm: User Manual

Biology 644: Bioinformatics

Blast2GO User Manual. Blast2GO Ortholog Group Annotation May, BioBam Bioinformatics S.L. Valencia, Spain

Transcription:

MetaPhyler Usage Manual Bo Liu boliu@umiacs.umd.edu March 13, 2012 Contents 1 What is MetaPhyler 1 2 Installation 1 3 Quick Start 2 3.1 Taxonomic profiling for metagenomic sequences.............. 2 3.2 Build your own classifier, and perform classification............ 2 3.3 Which Metaphyler mode: blastn or blastx?................ 3 3.4 Run Metaphyler pipeline step by step................... 3 4 Program Usage 4 4.1 simureads................................... 4 4.2 metaphylertrain............................... 4 4.3 metaphylerclassify.............................. 4 4.4 combine.................................... 5 4.5 taxprof..................................... 5 5 Citation 5 1 What is MetaPhyler MetaPhyler is a taxonomic classifier for metagenomic (or anonymous) DNA or protein sequences. Given a set of reference sequences and their taxonomic labels, MetaPhyler can train itself using these information, and classify anonymous query sequences into the taxonomic labels. We have pre-trained MetaPhyler on a set of phylogenetic marker genes, and these pre-trained models can be used to classify metagenomic sequences and estimate the taxonomic profile. 2 Installation Requirements: g++; Perl; NCBI BLAST. Installation has been tested on RHEL Server r5.5, with gcc v4.5.2 and perl v5.8.8. Uncompress the software package: tar xzvf package-name Then enter into the directory and install Metaphyler:./installMetaphyler.pl 1

3 Quick Start Here we show how to perform common tasks using test datasets in folder test : (1) test.ref.dna: reference DNA sequences in FASTA format. (2) test.ref.protein: reference protein sequences in FASTA format. Same as in (1). (3) test.ref.taxonomy: taxonomy labels for reference sequences. (4) test.query.dna: DNA sequences to be classified. 3.1 Taxonomic profiling for metagenomic sequences Here, we have a set of metagenomic DNA sequences, which could be reads or predicted ORFs, and you want to know the taxonomic composition. Metaphyler tries to identify the phylogenetic marker genes present in the sample, classify them and compute the taxonomy profile. The classifiers for these phylogenetic marker genes have already been trained. Suppose your reads are in file test.query.dna in folder test; outputs are stored in files with prefix test.blastn or test.blastx ; BLAST was run with 1 thread. For Illumina reads with average length 100bp, we recommend:./runmetaphyler.pl./test/test.query.dna blastn test.blastn 1 For longer sequences (e.g., 454 or Sanger), we recommend:./runmetaphyler.pl./test/test.query.dna blastx test.blastx 1 Best performance can be achieved by running both blastn and blastx models, and combine the classifications:./combine test.blastn.classification test.blastx.classification 3.2 Build your own classifier, and perform classification Suppose test.query.dna contains DNA sequences with length range from 200bp to 400bp. Then we build classifier by simulating reads from test.ref.dna, and align them to test.ref.protein using blastx. The following command specifies that we simulate 200bp, 300bp and 400bp DNA fragments; outputs are stored in files with prefix test ; run blastx with single thread../buildmetaphyler.pl norm./test/test.ref.dna./test/test.ref.protein 200,300,400./test/test.ref.taxonomy blastx test 1 Run classification:./runclassifier.pl./test/test.query.dna./test/test.ref.protein./test/test.ref.taxonomy test.blastx.classifier blastx test 1 Suppose you only have the DNA sequences test.query.dna, then the training and classification can only be performed with blastn:./buildmetaphyler.pl norm./test/test.ref.dna./test/test.ref.dna 200,300,400./test/test.ref.taxonomy blastn test 1./runClassifier.pl./test/test.query.dna./test/test.ref.dna./test/test.ref.taxonomy test.blastn.classifier blastn test 1 Usually blastn works better for short reads (e.g., 100bp Illumina reads); blastx works better for sequences longer than 100bp. But you are welcome to try both approaches on you dataset, and combine the classification results. 2

3.3 Which Metaphyler mode: blastn or blastx? Our suggestion is that try both, and then merge the classification results: for each query read, pick the classification that has higher confidence score. In addition, we have some empirical guidlines: (1) for short reads (e.g., 100bp Illumina reads), use blastn model; otherwise, use blastx. (2) for finding remote homologies, and classify query sequences that are not very close to your reference genes, use blastx. (3) if your reference sequences are not protein genes (e.g., 16S rrna), which means they do not have amino acid sequences, then use blastn. (4) if your sequences have many deletions or insertions (e.g., 10 insertions or deletions in a 300bp read), then use blastn. 3.4 Run Metaphyler pipeline step by step If you want to have full control of the program, e.g., BLAST usually takes long time to run, and you want to parallelize it in your computer clusters, in the following we show how to run the whole pipeline step by step. (1) Simulate reads (300bp, 30bp step size):./simureads 300 30./test/test.ref.dna > test.300.30.dna (2) Compile blast database: formatdb -p T -i./test/test.ref.protein (3) Run blastx: blastall -p blastx -m8 -b1000 -e1e-3 -i test.300.30.dna -d./test/test.ref.protein > test.300.30.blastx You can parallelize blast in various ways. Also you can tune evalue parameter depending on your sequence length. (4) Compute classifier:./metaphylertrain norm./test/test.ref.taxonomy./test/test.ref.protein test.300.30.blastx 300 blastx > test.300.blastx.classifier You can repeat above steps (1, 3 and 4) for other lengths, e.g., 200 and 400, and put all classifiers into one file, say test.blastx.classifier. (5) Align query sequences to references: blastall -p blastx -m8 -b1 -e1e-2 -i./test/test.query.dna -d./test/test.ref.protein > test.query.blastx Note that you only need 1 blast hit for each query sequence, so use option -b1. You can adjust the evalue parameter depending your sequence length, and parallelize blast in various ways. (6) Classify query reads:./metaphylerclassify test.blastx.classifier./test/test.ref.taxonomy test.query.blastx > test.classification Note that classifiers based on phylogenetic marker genes for taxonomic profiling have been computed: markers.blastn.classifier, markers.blastx.classifier and markers.taxonomy in folder markers. In this case, you only need to run step 5 and 6 for your own metagenomic sequences. 3

4 Program Usage 4.1 simureads Step size 30bp is recommended. The smaller the step size, the more accurate. But at the same time, it takes longer time to train the classifier in the next few steps../simureads <length> <step size> <FASTA file> <length> <step size> length of reads to be simulated. distance between two simulated reads. 4.2 metaphylertrain If the the taxonomy is well defined, and there are sampling biases between the clusters in the taxonomy, then normalization is recommended. For example, in the NCBI taxonomy, some species (Escherichia coli) have much more genomes sequenced that others do. Option norm is used to normalize against these biases../metaphylertrain <norm unnorm> <taxonomy> <ref seq> <BLAST file> <length> <BLAST> <norm unnorm> <taxonomy> <ref seq> <BLAST file> <length> <BLAST> Perform normalization (norm) or not (unnorm). Taxonomy labels of reference sequences in the BLAST file. Reference sequence FASTA file. BLAST alignment between simulated reads and reference sequences. All simulated reads come from reference sequences, and are named as follows: If n reads come from A, then their IDs are A_0, A_1,..., A_n-1. Length of simulated reads. BLASTN, BLASTP, BLASTX or TBLASTX. 4.3 metaphylerclassify To classify the reads in the blast file, you may have trained several classifiers through simulation for different sequence lengths. To use all these classifiers, you can simply concatenate them into one file, or you can list them one by one in the program arguments../metaphylerclassify <classifiers> <taxonomy file> <BLAST file> <scores file> Output from program blast2taxscores. If there are multiple files, separate them with comma (e.g., filea,fileb). You can also merge them into one file. <taxonomy file> Taxonomy labels of reference sequences in the BLAST file. <BLAST file> BLAST alignment between query reads and reference sequences. 4

4.4 combine For a set of query reads, you may have several classifications. For example, one from blastn classifiers and one from blastx classifiers. To combine them, simply list the files as the argument of this program../combine <classification 1> <classification 2>... <classification> Result file from program metaphylerclassify. 4.5 taxprof Metaphyler has already trained classifiers for phylogenetic marker genes. After classification of metagenomic sequences, we want to know the taxonomy profile. This program summarizes the classification information, and reports the taxonomy profiles at different taxonomic levels../taxprof <conf. cutoff> <classification> <prefix> <taxonomy names> <conf. cutoff> Cutoff for confidence score. Recommendation: 0.9. <classification> Result file from program metaphylerclassify. <prefix> Output files prefix. <taxonomy names> File: 1st column, taxonomy ID; 2nd, name. If omitted, output will just use taxonomy IDs. Output files: prefix.<genus family order class phylum>.taxprof. Taxonomy profiles at each level. 5 Citation Please cite the following paper as the reference to this program. Liu B., Gibbons T, Ghodsi M, Treangen T and Pop M(2010). Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics 2011, 12(Suppl 2):S4. 5