Using Pipeline Output Data for Whole Genome Alignment

Size: px

Start display at page:

Download "Using Pipeline Output Data for Whole Genome Alignment"

Veronica Walsh
5 years ago
Views:

1 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4 Introduction 4 Pipeline 4 Maq 4 GBrowse 4 Hardware Requirements 5 Workflow 6 Preparing to Run Maq 6 UNIX/Linux Environment 6 Testing PERL 6 Installing Maq 7 Getting Reference Sequences 8 Reference Genome with Multiple Chromosomes 9 Output File from Pipeline 9 Required Pipeline Output File 9 Format of Sequence.txt File 10 Quality Values 11 Getting Consensus, Identifying SNPs and Indels 11 Building Consensus 13 Extracting Consensus Information Part # , Rev. A May 2008

2 2 13 SNP Calling 16 Indel Discovery 18 Viewing SNPs and Indels with GBrowse 18 GBrowse 18 Reformatting Data 22 Using GBrowse 25 Appendix A: Installing Maq Yourself 26 Appendix B: Quality Value Tables 26 Illumina Symbolic ASCII Quality Values 27 Sanger Symbolic ASCII Quality Values Part # , Rev. A

3 This publication and its contents are proprietary to Illumina, Inc., and are intended solely for the contractual use of its customers and for no other purpose than to operate the system described herein. This publication and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina, Inc. For the proper operation of this system and/or all parts thereof, the instructions in this guide must be strictly and explicitly followed by experienced personnel. All of the contents of this guide must be fully read and understood prior to operating the system or any of the parts thereof. FAILURE TO COMPLETELY READ AND FULLY UNDERSTAND AND FOLLOW ALL OF THE CONTENTS OF THIS GUIDE PRIOR TO OPERATING THIS SYSTEM, OR PARTS THEREOF, MAY RESULT IN DAMAGE TO THE EQUIPMENT, OR PARTS THEREOF, AND INJURY TO ANY PERSONS OPERATING THE SAME. Illumina, Inc. does not assume any liability arising out of the application or use of any products, component parts, or software described herein. Illumina, Inc. further does not convey any license under its patent, trademark, copyright, or common-law rights nor the similar rights of others. Illumina, Inc. further reserves the right to make any changes in any processes, products, or parts thereof, described herein without notice. While every effort has been made to make this guide as complete and accurate as possible as of the publication date, no warranty or fitness is implied, nor does Illumina accept any liability for damages resulting from the information contained in this guide Illumina, Inc. All rights reserved. Illumina, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iselect, CSPro, iscan, and GenomeStudio are registered trademarks or trademarks of Illumina. All other brands and names contained herein are the property of their respective owners.

4 4 Introduction The Genome Analyzer can generate several Gb of data a week. Converting these huge amounts of sequence data into usable information requires fast and efficient downstream analysis. This document describes how to align Genome Analyzer Pipeline sequence data to a known genome using the Mapping and Assembly with Quality (Maq) application. Results can then be assessed opening the output files, or imported into a GBrowse implementation to view in the genomic context. NOTE This guide does not explain how to use Pipeline, and only provides limited information for the use of Maq and GBrowse. The main goal is to provide a path to efficiently use Pipeline output for whole genome alignment. The key sections of this guide are: Preparing to Run Maq on page 6 Gives information on installing Maq. Output File from Pipeline on page 9 Describes the fields in the relevant Pipeline files and the various metrics. Getting Consensus, Identifying SNPs and Indels on page 11 Explains how to get a consensus sequence, SNPs and indels from Maq. Viewing SNPs and Indels with GBrowse on page 18 Explains how to use GBrowse to view SNPs and indels. Pipeline Maq GBrowse Hardware Requirements The Genome Analyzer Pipeline software is a highly customizable workflow engine capable of taking the raw image data generated by the Genome Analyzer and producing intensity scores, base calls, quality metrics, and quality scored alignments. This software is the result of extensive collaborations with many of the world s leading sequencing centers. Maq is a third party open source software tool that builds mapping assemblies from short reads generated by next-generation sequencing machines. Maq is specifically developed for the Genome Analyzer by Heng Li and Richard Durbin from the Sanger Institute. Maq runs on UNIX/Linux, so you will need a computer that uses Linux or UNIX as the operating system. GBrowse is an open source genome viewer, generated as part of the Generic Model Organism Database project (GMOD). Many genome centers and universities have implemented GBrowse to enable you to view their genomic data. At minimum, you will need 1 GB of memory. This should be enough to map 2 million reads to a bacterial genome, though 4 GB is preferable. For mammalian-sized genome alignments, you will need to map many batches of about 2 million reads, and you will be better served with 16 GB of memory. Part # , Rev. A

5 5 Workflow The workflow for generating consensus, SNPs and indels is illustrated in Figure 1. Figure 1 Workflow Generating Consensus, SNPs and Indels Pipeline to Maq to GBrowse

6 6 Preparing to Run Maq Before you can install Maq, there are a number of requirements you need to fulfill. This section lists these requirements, and gives some options for installing these. UNIX/Linux Environment You need to install Maq in an environment that runs on UNIX or Linux (a version of UNIX). Workstation Your best option is to run Maq on a dedicated UNIX or Linux workstation. See if you can find such a workstation in your department where you can install and run Maq. You may need to install Linux on a computer from scratch. Talk to your IT department to see what is required, and whether they can help. Linux Distributions If you do not have access to a workstation running UNIX/Linux and you need to install Linux, there are many different distributions of Linux available, paid or free. Good choices are Red Hat Linux (paid) and Fedora Linux (free), but others should work too. Use the documentation provided with your Linux distribution for installation. Testing PERL Installing Maq Maq uses a number of scripts that are written in the programming language Perl. Many UNIX/Linux distributions already have Perl installed, so first check whether Perl is installed in your UNIX/Linux environment by typing the following: 1. Go to your UNIX/Linux environment 2. In the command prompt, enter: perl -v 3. Evaluate whether you have Perl installed: If Perl has been installed, you will get a message stating the version of Perl, copyright and other information. Continue with the section Installing Maq. If Perl is not installed yet, you will get a message like this: perl: command not found If Perl is not installed yet, go to and install the most recent fully released version of Perl for Linux and your hardware configuration. When your Linux environment is set up, ask your IT department to install Maq. The download is available from maq.sourceforge.net (Figure 2). We used Maq versions and to test the application. Part # , Rev. A

7 Maq User s Manual Maq Reference Manual Maq FAQ Download page Maq Wiki Figure 2 Maq Home Page NOTE If you have to install Maq yourself, refer to Appendix A: Installing Maq Yourself on page 25.

7 7 Maq User s Manual Maq Reference Manual Maq FAQ Download page Maq Wiki Figure 2 Maq Home Page NOTE If you have to install Maq yourself, refer to Appendix A: Installing Maq Yourself on page 25. Getting Reference Sequences You need to download a reference genome for the organism you sequenced to compare it to. Many are available from the NCBI website. 1. Open your browser and navigate to 2. Click on the link Genomic Biology in the left navigation bar. 3. Browse to your species under Genome Projects Database in the right navigation bar. 4. Navigate to or search for the species you are looking for, and click on Project data Genomic 5. Download the genomic files in fasta format (*.fasta, *.fa or *.fna). Download each chromosome of your organism. 6. Make sure to keep track of the exact build of the genome you are using. You can find this in the genbank file, in the Comments section. NOTE Another good source for reference genomes is UCSC (hgdownload.cse.ucsc.edu). Pipeline to Maq to GBrowse

8 8 Reference Genome with Multiple Chromosomes If you use a reference genome with multiple chromosomes, you may only find them as a fasta file per chromosome. You will need to combine these fasta files in one file for the reference genome, else your alignment scores may be affected. Perform the following: 1. Open the command line (Terminal) in Linux. 2. Go to the directory containing the downloaded reference genome files using the cd command. 3. Enter the following: cat chr1.fa chr2.fa chr3.fa >ref.fa where: chr1.fa chr2.fa and chr3.fa are the fasta input files. ref.fa is the fasta reference genome output file. Part # , Rev. A

9 9 Output File from Pipeline After you called the bases in Pipeline, Pipeline saves files containing the sequence information. This section specifies what file you need from Pipeline for alignment in Maq, and explains the different elements in this file. Required Pipeline Output File The Pipeline output file you should use for alignment in Maq has the following naming scheme: s_n_r_sequence.txt (for paired-end sequence files) or s_n_sequence.txt (for single-read sequence files) where: The N stands for the lane. The R stands for the read, in case of paired-end sequencing. An example of a sequencing reads for one clusters is s_3_2_sequence.txt; this file contains information from read 2 of lane 3. Format of Sequence.txt File The s_n_r_sequence.txt file contains sequence and quality information for one read from one sequencing lane. The files are in FASTQ format. An example of an entry for one read is shown GCCTAACCTTTCTGAACCTCATGCGGAAAAACTGTTT +SLXA-B3_604:2:1:512:767/1 ccccccccccccchkhcchcu`]`lpvrtinksnlaa Every entry contains the following lines: Read Identifier: The contains the read identifier, which has the following elements: Description Abbreviated run name Element SLXA-B3_604 Lane 2 Tile 1 Coordinates of the cluster on tile 512,767 Indicates the read of a paired end run /1 The read indentifier line starts with an '@', which indicates this line is going to be followed by a sequence line. Sequence: The line GCCTAACCTTTCTGAACCTCATGCGGAAAAACTGTTT contains the called sequence for this entry. Pipeline to Maq to GBrowse

10 10 Read Identifier: The line +SLXA-B3_604:2:1:512:767/1 contains the same read identifier as above, but this time the line starts with a '+', which indicates it is going to be followed by a quality score line. Quality scores: The line ccccccccccccchkhcchcu`]`lpvrtinksnlaa contains the quality scores for this entry. Every base call in an entry has a corresponding quality score, i.e., the n th position in the quality scores line corresponds to the n th nucleotide in the sequence line. Quality Values The quality scores are in Illumina symbolic ASCII format, according to the following formula: Quality value = (ASCII character code) The values of the characters in the Illumina symbolic ASCII format are listed in the Appendix, section Illumina Symbolic ASCII Quality Values on page 26. For a single basecall, a Q value of 30 is great, Q20 is a good score, while Q10 is still usable. Difference of Illumina and Phred Scoring Scheme The Illumina quality scoring scheme and the Phred quality scoring scheme are different: Illumina: 10 x log10((1-e)/e) Phred: -10log10(e) where: e=error probability. The two definitions round to the same value from approximately Q15 and above, however our scores can go as low as -5. Difference of Illumina and Sanger FASTQ The Sanger FASTQ format, which is used by Maq, differs slightly from the Illumina FASTQ format. The main difference is that the quality of the base calls is scored using different scales (Illumina versus Phred quality scores). Maq comes with tools to convert Illumina FASTQ (also often called Solexa FASTQ) to Sanger FASTQ; see Preparing to Run Maq on page 6 and the Maq documentation for more information. Part # , Rev. A

11 11 Getting Consensus, Identifying SNPs and Indels Maq aligns your sequence reads to a reference sequence, builds a consensus and calls single nucleotide polymorphisms (SNPs), and can identify insertion/ deletions (indels) if you have performed paired-end sequencing. This section explains briefly how to perform these actions, and what output files you will get when you call SNPs and identify indels. A lot of this information has been summarized from the Maq user s manual and the Maq reference manual, available at maq.sourceforge.net (see Figure 2). For more detailed instructions and comprehensive descriptions of the commands in Maq, see these documents; additional information is present in the FAQ section and in the Maq Wiki. Generating Analysis Folder Building Consensus You need to generate a folder in which you run the analysis. Copy the following files to this folder: Read files (Illumina FASTQ format). Reference sequence file (FASTA format). All output files Maq generated will be stored in this folder (unless you specifically direct Maq to another folder). The first thing you need to do is align the reads to the reference, and build a consensus. This is described in this section. NOTE For small sequencing projects (1 lane of sequence data from a procaryote), many of these steps can be combined as a batch using the easyrun command. See the Maq user s manual for information. Converting Illumina FASTQ to Sanger FASTQ As described in Quality Values on page 10, the FASTQ format used by Maq is different from the Illumina FASTQ format. To use Maq, you need to first convert the format for all read files by entering: maq sol2sanger s_n_r_sequence.txt s_n_r_sequence.fastq where: s_n_r_sequence.txt is the Illumina read sequence file s_n_r_sequence.fastq is the output file in Sanger FASTQ. Converting Sanger FASTQ to BFQ Next you need to convert Sanger FASTQ to binary FASTQ (bfq) for all read files by entering: maq fastq2bfq s_n_r_sequence.fastq s_n_r_sequence.bfq where: s_n_r_sequence.fastq is the Sanger FASTQ read sequence file. s_n_r_sequence.bfq is the output file in binary FASTQ. Pipeline to Maq to GBrowse

12 12 Converting Reference FASTA to BFA Next you need to convert FASTA to binary FASTA (bfa) for the reference sequence by entering: maq fasta2bfa ref.fasta ref.bfa where: ref.fasta is the FASTA reference sequence file. ref.bfa is the output reference file in binary FASTA. Aligning Reads to Reference For single-read sequencing, you align the reads from one file to the reference sequence by entering: maq map s_n_sequence.map ref.bfa s_n_sequence.bfq For paired-end sequencing, you align the reads from two matching pairedend files to the reference sequence by entering: maq map s_n_sequence.map ref.bfa s_n_1_sequence.bfq s_n_2_sequence.bfq where: s_n_sequence.map is the mapped alignment output file. ref.bfa is the reference file in binary FASTA. s_n_sequence.bfq is the single-read output file in binary FASTQ. s_n_1_sequence.bfq is the paired-end first read output file in binary FASTQ. s_n_2_sequence.bfq is the paired-end second read output file in binary FASTQ. NOTE When you align paired-end reads, you will get a message that indicates the success of the pairing: (total, ispe, mapped, paired) = ( , 1, , 6142) The number of mapped reads should be close to the number of paired reads. If the number of paired samples is very low (6142 in the example above), and you have done long distance paired-end reads, you need to specify the maximum read length (which should be slightly longer than the average paired-end fragment length). For example, for paired-end reads from 500 bp fragments, add a maximum fragment length of 550 bp by adding the argument -a 550, i.e. enter the following: maq map -a 550 s_n_sequence.map ref.bfa s_n_1_sequence.bfq s_n_2_sequence.bfq Merging Map Files Maq works best with 1 to 3 million reads as input when aligning reads to the reference sequence. If you have a big sequencing project with multiple lanes, you should perform the alignment per lane first, and then combine the map files using mapmerge. So if you used multiple lanes to sequence the same sample, you can combine the mapped alignments now by entering: Part # , Rev. A

13 13 maq mapmerge s_123_sequence.map s_1_sequence.map s_2_sequence.map s_3_sequence.map where: s_123_sequence.map is the combined mapped alignment output file for lane 1,2, and 3. s_n_sequence.map is the mapped alignment file for lane N. Building Consensus Now you can assemble the consensus from the (merged) map files: maq assemble s123.cns ref.bfa s_123_sequence.map where: s123.cns is the consensus output file ref.bfa is the reference file in binary FASTA. s_123_sequence.map is the merged mapped alignment file. Extracting Consensus Information Once you have built the consensus, you can extract the new consensus sequence in FASTA format, or in FASTQ format (containing Sanger quality scores). Extracting Consensus in FASTA Format To extract the consensus in FASTA format, enter the following: maq cns2ref s123.cns >s123.cns.fasta where: s123.cns is the consensus file. s123.cns.fasta is the output consensus file in FASTA. Extracting Consensus in FASTQ Format To extract the consensus in Sanger FASTQ format, enter the following: maq cns2fq s123.cns >s123.cns.fastq where: s123.cns is the consensus file. s123.cns.fastq is the output consensus file in FASTQ. The files are saved in the Sanger FASTQ format, with quality scores in the Sanger symbolic ASCII format (see Quality Values on page 10 for differences with the Illumina quality scheme). The quality scores are in Sanger symbolic ASCII format, according to the following formula: Quality value = (ASCII character code)- 33 The values of the characters in the Sanger symbolic ASCII format are listed in the Appendix, section Sanger Symbolic ASCII Quality Values on page 27. SNP Calling Extracting SNP Calls Once you have built the consensus, extract SNPs the following way: maq cns2snp s123.cns >s123.snp Pipeline to Maq to GBrowse

14 where: s123.cns is the consensus file s123.snp is the tab-delimited, output snp file. SNP File To view the SNP calls, open the snp file in excel (Figure 3).

14 14 where: s123.cns is the consensus file s123.snp is the tab-delimited, output snp file. SNP File To view the SNP calls, open the snp file in excel (Figure 3). Chromosome/ Reference Position Reference Base Consensus Base Consensus Quality Read Depth Average # Hits Highest Mapping Quality Quality Difference Figure 3 SNP File Opened in Excel The columns contain the following information: Column Name Description A Chromosome / Reference Chromosome or reference sequence. B Position Position of SNP on the reference sequence. C Reference Base The base as present in the reference sequence. D Consensus Base The base called in the consensus of your sequencing reads. E Consensus Quality The quality of the base called in the consensus. This is the Sanger quality, which is different from the Illumina quality scores (see Difference of Illumina and Phred Scoring Scheme on page 10). F Read Depth The amount of reads covering the position. G Average # Hits The average number of hits of reads covering this position, which roughly equals the copy number of the flanking region in the reference genome. Part # , Rev. A

15 15 Column Name Description H I Highest Mapping Quality Quality Difference The highest mapping quality of the reads covering the position. The quality difference between the strong allele and the weak allele. If the quality difference is close to the highest mapping quality, you may be looking at a read error. For the consensus bases, heterozygotes are designated using IUB codes: IUB code A C G T M K Y R W S D B H V N Bases A C G T A/C G/T C/T A/G A/T G/C A/G/T C/G/T A/C/T A/C/G A/C/G/T Improving SNP Quality In addition, the following commands are useful for filtering SNP calls: SNPfilter. SNPfilter removes SNPs that are covered by just one read, fall in a repetitive region, or fall in a 10 bp region with at least 3 SNPs. Enter the following: perl maq.pl SNPfilter s123.snp >s123.filtered.snp where: s123.snp is the consensus file. Pipeline to Maq to GBrowse

16 s123.filtered.snp is the tab-delimited, output filtered snp file. rmdup. Rmdup removes pairs with identical ends, which could have been caused by PCR at sample prep.

16 16 s123.filtered.snp is the tab-delimited, output filtered snp file. rmdup. Rmdup removes pairs with identical ends, which could have been caused by PCR at sample prep. Removing duplicates may improve SNP calling accuracy. This filter needs to be done before the consensus is assembled (Building Consensus on page 13); use it as follows: maq rmdup s_123_rmdup.map s_123_sequence.map where: s_123_rmdup.map is the output filtered mapped alignment file s_123_sequence.map is the input mapped alignment file Indel Discovery Extracting Indels Once you have built the consensus, you can extract the indels the following way: maq indelpe ref.bfa s_123_sequence.map >s_123_sequence.indelpe where: ref.bfa is the reference file in binary FASTA. s_123_sequence.map is the merged mapped alignment file. s_123_sequence.indelpe is the tab-delimited, output indel file. NOTE You can only find indels using Maq with paired-end data. Indel File To view the indels found, open the indel file in excel (Figure 4). Chromosome/ Reference Position Indel Type # Ref Reads Indel Size Forward Reads Reverse Reads Figure 4 Indel File Opened in Excel Part # , Rev. A

17 17 The columns contain the following information: Column Name Description A Chromosome / Reference Chromosome or reference sequence. B Start Position Start position of indel on reference sequence. C Indel Type * Indicates the indel is confirmed by reads from both strands. + Means the indel is hit by at least two reads but from the same strand. - Shows the indel is only found on one read.. Means the indel is too close to another indel and is filtered out. D # Ref Reads The number of reads across the indel. E Indel Size Size of indel. F Forward Reads Number of reads on the forward strand confirming the consensus. G Reverse Reads Number of reads on the reverse strand confirming the consensus. NOTE If you want to concentrate on the most promising indels, filter the indel file in excel for * in the Indel Type field. Pipeline to Maq to GBrowse

18 18 Viewing SNPs and Indels with GBrowse Once you have files with SNPs and indels, you may want to view them in a genomic context. Many genome centers have implimented GBrowse, an open source genome viewer. This section helps you viewing your results in a GBrowse viewer. You will need to perform the following steps: 1. Find a GBrowse implementation for the organism and build you are interested in. 2. Transfer your SNP or indel data to the proper file format. 3. Upload the file to GBrowse. Now you are ready to look at your SNPs and indels as annotations in a genomic context. GBrowse GBrowse is an open source genome viewer, generated as part of the Generic Model Organism Database project (GMOD). Many genome centers and universities have implemented GBrowse to enable you to view their genomic data. Finding Suitable GBrowse Implementation Lists of implementations can be found at the following two websites: Browse through these lists and see if there is a GBrowse implementation for the organism and build you are interested in. These lists are not comprehensive; if you can t find one you can use, try entering GBrowse and your particular build in google, and see if you can find an appropriate implementation that way. Alternative Solutions If no suitable implementation of GBrowse exists, you can do two things: Redo your alignments with a build that is supported in a GBrowse implementation. Install GBrowse locally. This is possible, but requires more work and skill. See for instructions. Reformatting Data The SNP and indel files do not have the appropriate format for GBrowse to recognize. Fortunately, they are usually not extremely big, and can be handled in Excel, and you do not need a Perl script to change the format. This section explains how to reformat your SNP or indel data. Annotation File Format GBrowse can read a number of different file formats. Here we explain the annotation file format that works well with our data (Figure 5). Part # , Rev. A

19 Figure 5 GBrowse File The annotation file is a text file, and has to start with the following line: reference=landmark name The reference line has the following properties: The line starts with

19 19 Figure 5 GBrowse File The annotation file is a text file, and has to start with the following line: reference=landmark name The reference line has the following properties: The line starts with reference= (in lowercase). The line refers to the chromosome (reference=chr1) or the accession number of the organism (reference=nc_000913). No spaces allowed. The reference applies to all entries below it, until a new reference is found. Multiple reference lines are allowed. The reference line is followed by data lines, which have the following fields: Column Entry Description A Feature Type In our case SNP or INDEL. B Feature Name A unique name for each entry. C Feature Position One or more ranges in the format , or , D Description (optional) A description that will be displayed in the viewer. E URL (optional) If you have a hyperlink, provide it here. NOTE Do not use spaces, unless you put quotation marks around the field entry. Pipeline to Maq to GBrowse

20 Reformatting SNP Files To reformat the SNP file, perform the following steps: 1. Open the SNP file in Excel. 2. To get a unique SNP name, enter SNP1 in the top field of the empty column J. 3.

20 20 Reformatting SNP Files To reformat the SNP file, perform the following steps: 1. Open the SNP file in Excel. 2. To get a unique SNP name, enter SNP1 in the top field of the empty column J. 3. You need to have a range of nucleotides for the feature position field. In the top field of the empty column K, enter: =CONCATENATE(B1,"-",B1) 4. You need one field with an informative description for every SNP. In the top field of the empty column L, enter: =CONCATENATE(C1,">",D1,",Q",E1,",",B1) The SNP description will consist of the following information: reference base>consensus base,quality score,position 5. To copy all formulas and calculate values for every entry: a. Select fields J1, K1, and L1 b. Drag down the selected fields by the bottom right corner (Figure 6). Select Bottom Right Corner Drag Down to Last Entry Figure 6 Drag Down Bottom Right Corner The values in column J and K should automatically recalculate, and column L should be filled with unique names (SNP1, SNP2, and so on). 6. Save the file in Excel format (*.xls). 7. Open a new book. This will be the annotation file 8. Copy the values from columns J, K and L of the modified SNP file to columns B, C and D of the annotation file (paste values only). 9. Enter SNP in the top field of the empty column A of the annotation file. Copy SNP all the way down to the last data line. 10. Select the first row and insert an empty line by pressing Ctrl Shift Enter the reference line in field A1, for example reference=chr1 or reference=nc_ Part # , Rev. A

21 NOTE You can refer to multiple chromosomes per file; just insert a reference line with the new chromosome above the data line where the next chromosome starts.

21 21 NOTE You can refer to multiple chromosomes per file; just insert a reference line with the new chromosome above the data line where the next chromosome starts. The reference applies to all entries below it, until a new reference is found. The SNP annotation file should look like this (Figure 7): Figure 7 SNP Annotation File 12. Save the SNP annotation file as a text (tab delimited) file (*.txt). Reformatting Indel Files To reformat the indel file, perform the following steps: 1. Open the indel file in Excel. NOTE If you want to concentrate on the most promising indels, filter the indel file in excel for * in the Indel Type field (column C), and copy all the promising indels to a new book. 2. To get a unique indel name, enter INDEL1 in the top field of the empty column H. 3. You need to have a range of nucleotides for the feature position field. In the top field of the empty column I, enter: =CONCATENATE(B1,"-",B1) 4. You need one field with an informative description for every indel. In the top field of the empty column J, enter: =CONCATENATE(C1,",",E1,",f",F1,",r",G1) The indel description will consist of the following information: Indel type,indel size,f forward reads,r reverse reads 5. To copy all formulas and calculate values for every entry: a. Select fields H1, I1, and J1 b. Drag down the selected fields by the bottom right corner (Figure 6). The values in column I and J should automatically recalculate, and column L should be filled with unique names (INDEL1, INDEL2, and so on). Pipeline to Maq to GBrowse

22 22 6. Save the file in Excel format (*.xls). 7. Open a new book. This will be the annotation file 8. Copy the values from columns H, I, and J of the modified indel file to columns B, C and D of the annotation file (paste values only). 9. Enter INDEL in the top field of the empty column A of the annotation file. Copy INDEL all the way down to the last data line. 10. Select the first row and insert an empty line by pressing Ctrl Shift Enter the reference line in field A1, for example reference=chr1 or reference=nc_ NOTE You can refer to multiple chromosomes per file; just insert a reference line with the new chromosome above the data line where the next chromosome starts. The reference applies to all entries below it, until a new reference is found. The indel annotation file should look like this (Figure 8): Figure 8 Indel Annotation File 12. Save the indel annotation file as a text (tab delimited) file (*.txt). Using GBrowse When you have generated your annotation file, and found a suitable GBrowse implementation, you can start viewing your indels or SNPs in a genomic context. For comprehensive GBrowse help, FAQs and a tutorial, see Upload the Annotation File 1. Navigate your web browser to the GBrowse running web site. 2. Scroll down to the bottom of the page, where you can upload your own annotations (Figure 9). Different GBrowse implementations may look slightly different. Part # , Rev. A

23 Browse to File Upload File Figure 9 Upload Annotation File 3.

Viewing SNPs and Indels Once your annotation file is uploaded you will see the file appear with the separate features (Figure

23 23 Browse to File Upload File Figure 9 Upload Annotation File 3. Click Browse, go to the annotation file, select the file, and click Open. 4. Click Upload. Viewing SNPs and Indels Once your annotation file is uploaded you will see the file appear with the separate features (Figure 10). Figure 10 Uploaded Annotation File Annotation Check Box Uploaded Annotation File Edit File Clickable Features Make sure the annotation check box is selected. You can now edit the uploaded annotation file, or click on the separate features (SNPs or indels). This will display the feature in the viewer panel (Figure 11 and Figure 12). Zoom and Browse Area Gene Information Published SNPs Your Favorite SNP Figure 11 Your Favorite SNP in the GBrowse Viewer Pipeline to Maq to GBrowse

24 24 Zoom and Browse Area Gene Information Your Favorite Indels Figure 12 Your Favorite Indels in the GBrowse Viewer Part # , Rev. A

25 25 Appendix A: Installing Maq Yourself If you decide to install Maq yourself, do the following: 1. Open your browser in Linux and navigate to maq.sourceforge.net. 2. Click on the link download page (see Figure 2). 3. Click on the link Download for the most recent version of Maq. 4. Click on the package for your Linux and hardware configuration. If you are not sure which one is best, choose platform independent. 5. Click Save to download the package. 6. Repeat steps 3 to 5 for Maqview and Maq-Data. 7. Open the command line (Terminal). 8. Go to the directory containing the downloaded files using the cd command. The exact location depends on how your Linux is set up. 9. To unzip the packages type the following in the command line: bunzip2 *.bz2 10. List the directory contents by using the ls command. 11. To remove the files from the archive, type the following for every *.tar file in the directory: tar xvf name.tar You should get three new directories (check by using the ls command). 12. Go to the directory containing the Maq files: cd maq-x.x.x 13. Install the package by entering the following three commands in succession:./configure make make install 14. If you get a message that access is denied to the default install directory, you need to specify a directory that you do have access to. Enter the following two commands:./configure --prefix=/home/share/yourfolder (with /home/share/yourfolder your accessible directory) make install 15. Go one directory up: cd Test whether Maq is working by entering: maq You should get a message explaining Maq usage. If the command maq is not recognized, try the second method decribed in the Maq User Manual, or ask a Linux expert for help. Pipeline to Maq to GBrowse

26 26 Appendix B: Quality Value Tables Illumina Symbolic ASCII Quality Values The quality values of the characters in the Illumina symbolic ASCII quality values are listed in the table below: Table 1 Quality Value of Characters in the Illumina Symbolic ASCII Format Char. Code Qual. Value Char. Code Qual. Value Char. Code Qual. Value Char. Code Qual. Value Char. Code Qual. Value Char. Code Qual. Value ; -5 C 3 K 11 S 19 [ 27 c 35 < -4 D 4 L 12 T 20 \ 28 d 36 = -3 E 5 M 13 U 21 ] 29 e 37 > -2 F 6 N 14 V 22 ^ 30 f 38? -1 G 7 O 15 W 23 _ 31 g 0 H 8 P 16 X h 40 A 1 I 9 Q 17 Y 25 a 33 B 2 J 10 R 18 Z 26 b 34 Part # , Rev. A

27 27 Sanger Symbolic ASCII Quality Values The quality values of the characters in the Sanger Symbolic ASCII Quality Values are listed in the table below: Table 2 Quality Value of Characters in the Sanger Symbolic ASCII Format Char. Code Qual. Value Char. Code Qual. Value Char. Code Qual. Value Char. Code Qual. Value Char. Code Qual. Value Char. Code Qual. Value Char. Code Qual. Value! 0 / 14 = 28 K 42 Y 56 g 70 u 84 " > 29 L 43 Z 57 h 71 v 85 # ? 30 M 44 [ 58 i 72 w 86 $ N 45 \ 59 j 73 x 87 % A 32 O 46 ] 60 k 74 y 88 & B 33 P 47 ^ 61 l 75 z 89 ' C 34 Q 48 _ 62 m 76 { 90 ( D 35 R n ) E 36 S 50 a 64 o 78 } 92 * F 37 T 51 b 65 p 79 ~ G 38 U 52 c 66 q 80, 11 : 25 H 39 V 53 d 67 r ; 26 I 40 W 54 e 68 s < 27 J 41 X 55 f 69 t 83 Pipeline to Maq to GBrowse

28 Illumina, Inc Towne Centre Drive San Diego, CA ILMN (4566) (outside North America)

EcoStudy Software User Guide

EcoStudy Software User Guide FOR RESEARCH USE ONLY What is EcoStudy? 3 Setting Up a Study 4 Specifying Analysis Settings for your Study 6 Reviewing the Data in your Study 8 Exporting Study Data to a Report