Omixon PreciseAlign CLC Genomics Workbench plug-in

Omixon PreciseAlign CLC Genomics Workbench plug-in User Manual

User manual for Omixon PreciseAlign plug-in CLC Genomics Workbench plug-in (all platforms) CLC Genomics Server plug-in (all platforms) January 12, 2012 Omixon Biocomputing Kft Petzval József utca 56., Budapest, 1119 Hungary www.omixon.com - info@omixon.com 2

Contents INTRODUCTION TO THE PLUG-IN... 4 STARTING THE PLUGIN... 5 MAPPING LETTER SPACE READS... 6 SETTING THE BASIC ALIGNMENT PARAMETERS... 6 SETTING THE SPECIES PARAMETERS... 7 Species Profile... 7 Speed Profile... 7 MAPPING COLOR SPACE READS... 9 SETTING THE ALIGNMENT PARAMETERS... 9 Alignment sensitivity... 9 Divergence... 9 SETTING THE SHORT READ PARAMETERS... 9 Handling non-specific reads... 10 Mapping paired reads... 10 SETTING THE QUALITY PARAMETERS... 10 ANALYZING THE RESULTS... 12 INSTALLATION... 13 WORKBENCH PLUG-IN INSTALLATION... 13 SERVER PLUG-IN INSTALLATION... 14 SYSTEM REQUIREMENTS... 14 UNINSTALL... 15 WORKBENCH UNINSTALL... 15 SERVER UNINSTALL... 15 3

Introduction to the plug-in The Omixon PreciseAlign plug-in is intended for the analysis of letter space (also called base space) and color space data produced by next generation sequencing (NGS) instruments. This plug-in is designed to work with the other sequence analysis tools and plug-ins provided by the CLC Genomics Workbench and Server. The modules within the Omixon PreciseAlign plug-in are based on the Omixon Color and Letter Space Toolkits. Illumina, Ion Torrent and 454 data For the alignment of letter space reads, the letter space module of the Omixon PreciseAlign plugin - called ORM (Omixon Read Mapper) - follows the seed-and-extend paradigm. Letter space (base space) reads are indexed by ORM using spaced seeds and approximately mapped to a reference sequence database. ORM uses a second, much smaller seed to help to filter the approximate mappings. The underlying data structures are extremely economical for memory use, yet still provide high flexibility for trade-offs between sensitivity and specificity. The fine alignment uses a combination of information and algorithms to produce its results, including the quality scores from the sequencer and a DNA mutation model. There are two main alignment techniques, a 'bridging' technique for smaller reads (such as Illumina short reads) where only one indel is expected, and a 'lacing' technique which allows for more indels per read, to cater for the longer Ion Torrent and Roche 454 reads. There is special handling of repeats in the mapping, in particular tandem repeats. The tool also provides very good homopolymer error correction for Ion Torrent and Roche 454 reads. SOLiD data For the alignment of color space reads, the Omixon PreciseAlign plugin utilises different algorithms, which were designed specifically for the Life Technologies SOLiD sequencer and it's data error model. The mapping uses a spaced seed technique with greedy extension to find the most likely location of the short reads on the reference. In such an alignment, reads are mapped to approximate genome positions, allowing for a pre-specified bound on sequence divergence that combines nucleotide mismatches, gaps, and sequencing errors. The precise alignment relies on a pair hidden Markov model (pair-hmm) framework, combining DNA sequence evolution models and sequencing errors (from read quality values). Variants can be reported with statistical confidence measures that take into account alignment accuracy and sequence quality scores. The first step of the alignment is called Crema (Color REad MApper). It uses a single spaced seed, to identify candidate mapping locations for each read. The spaced seed to be used is one of the parameters (called 'senstivity'). The second step of the alignment is called AMAP, and it will perform a fine-grained statistical alignment of the reads mapped by the Crema step. The mapped locations in the input are aligned statistically using a combination of pair-hmm and banded alignment algorithms that take into account the quality data. This allows for very accurate SNP and indel identification, along with identification and filtering of sequencing errors. You can find more detailed description of the algorithm in Csuros, M., S. Juhos, and A. Berces. 2010. Fast mapping and precise alignment of AB SOLiD color reads to reference DNA, p. 176 188. V. Moulton and M. Singh (ed.), Proceedings of the 10th International Conference on Algorithms in Bioinformatics. Springer-Verlag, Berlin, Germany Output The output from the Omixon PreciseAlign plug-in is a standard CLC bio read mapping object, which can be analysed using the CLC bio SNP Detection and DIP Detection tools (generating reports and/or reference annotations), or exported as a SAM file for further analysis. You can learn more about the Omixon Letter Space here and the Omixon Color Space here. 4

The basic work flow supported by the CLC Genomics Workbench or Server and the Omixon PreciseAlign plug-in is this: 1. Import reference genome (an example E. coli data set can be downloaded from clcbio.com). 1. Import sequencing data (using the High-throughput Sequencing Data import). Make sure that the box before 'Discard read names' is not checked! 2. Perform read mapping using the Omixon PreciseAlign plug-in. 3. Run SNP and DIP detection to detect variants in the sequencing data. Check the User Manual of the CLC Genomics Workbench for more info on steps 1, 2 and 4. This manual provides more info on step 3. Starting the plugin To run the Omixon PreciseAlign plugin: Go to Toolkit Omixon Tools Choose 454 or Illumina or IonTorrent or SOLiD map and align Select either Workbench or Server and Click Next. This opens the dialog shown in figure 1. Figure 1 Read file selection 5

Mapping letter space reads Select one read sequence to be mapped and click Next. This opens the dialog shown in figure 2. Figure 2 Basic parameter settings (If you want to align more than one short read files in the same run you have to choose the 'Batch' function and select the CLC folder containing the read files.) The reference genome to align against can be selected by clicking on the Browse icon. Setting the basic alignment parameters 'Max alignments reported' - How many alignments to report in total for each read or pair. ORM will track the best alignments and if there is more than one can output this. Reads or pairs with more than one alignment will be flagged as 'non-specific' reads (yellow color by default in the CLC view). 'Min alignment score' - The alignment step has an in-built quality filter. Reads whose scores are below this value after alignment (i.e. reads with a very low quality alignment) will automatically be discarded. 'Max indel' - The largest indel that will be allowed in an alignment. There is an additional performance enhancement included in the plug-ins: You can choose to 'Use All Available Processors'. By default the plug-ins will use all processors, however this can be deselected which will result in 'all but one' processor being used. A note of caution for Workbench users - if you leave this option set to the default you won't be able to do much else in the Workbench until the plug-in has finished. Choose the Reference sequence and values for the Basic parameters and Performance and click Next. This opens the dialog shown in figure 3. 6

Figure 3 Mapping parameters and speed settings Setting the species parameters Species Profile The plug-in includes some built-in 'profiles', which can be used to easily run the mapper. These profiles are: 'Bacterial 2.5': for mapping bacterial data at 2.5% divergence (appropriate for E. coli strains), 'Human': for mapping human data at 0.1% divergence, 'Human+': for mapping human data with higher sensitivity and lower speed, 'Other': highly customizable profile, for instructions on how to set the parameters see below. If you don't find the species you are working with within the species list you can select 'Other' and set your own species parameters. The most important parameter for the species is the estimated divergence between the sample and the reference, this must be set correctly for good results. 0.1% (human divergence) is expressed as 0.001, a bacterial divergence of 2.5% would be 0.025. Please note that this is most important parameter - a value that is much too high or much too low can seriously affect the quality of the alignment results. Advanced parameters Some of the advanced parameters will also become available for you to change. These include the 'big seed' and 'small seed' and the various alignment penalties. The program uses two seeds, a big seed to find candidate mapping locations and a small seed to help to filter the locations. These are both gapped seeds consisting of zeroes and ones. The alignment penalites are given on a Phred scale, and are 1/10k frequency for insert/delete, and 1/1k frequency for repeats. Speed Profile ORM is both accurate (sensitive and specific) and very fast. There is also a parameter that tells the aligner how 'tenacious' (normal or hasty) to be in trying to find good mapping locations. Higher tenacity leads to slower run times but better results. 7

Set the Species profile and Speed profile and click Next. The final step is the standard CLC dialog for results handling. Choose whether to Save or Open the results, and whether or note to Make Logs. To start the Omixon PreciseAlign click Finish. This plug-in will generate one kind of output, a CLC-style mapped reads object: Figure 4 Results 8

Mapping color space reads Select one (or more) short read sequences to be mapped and click Next. This opens the dialog shown in figure 5. The reference genome to align against can be selected by clicking on the Browse icon. Setting the alignment parameters Alignment sensitivity There are three sensitivity settings (i.e. three seeds) provided within this plug-in, called Fast, Sensitive and Ultra-sensitive. The seed used affects the sensitivity of the alignment, and also dramatically impacts the run time of the Crema step. The shorter the seed the more sensitive the mapping and the longer the run time. For many analyses (including using human data) the longer, less sensitive seeds will suffice. Divergence The main for AMAP is the expected divergence. If you know approximately how divergent your sample is from your reference you can set this parameter, which can improve the results. The default value is 0.001 (0.1%) which is sufficient for human data. Please note that this is most important parameter - a value that is much too high or much too low can seriously affect the quality of the alignment results. Figure 5 Parameter settings Setting the short read parameters In addition to the sensitivity setting of Crema you can also specify the maximum number of indels and SNPs ('mismatches') allowed for a single read to be matched. 9

Handling non-specific reads For non-specific reads (reads that map and align to multiple locations in the reference) there is some extra assistance. You can choose what kind of - or how many - mapping locations you wish to keep, and additionally how many of these locations (or how many paired locations) to report in total after the alignment step (Max Locations Reported). It's usually recommended to map a few more locations than will be finally reported, as the best mapping location does not always result in the best alignment. The plug-in supports three strategies for mapping non-specific reads. These three strategies don't apply to paired reads. The strategies are: Random. One mapping location is chosen at random, from the (up to) 5 best mapping locations found. This is useful for filling gaps in coverage due to large repeats. Ignore. A read has no mapping reported at all, if it maps to more than one location. Max. This one has the extra 'Max Locations Mapped' parameter to specify how many locations you would like to map the read to during the first Crema mapping step. The best mapping locations will be reported, up to the Max Locations Mapped value. These mapping locations will all be fed into AMAP, which will give each a separate set of alignment scores, which are then filtered using the Max Locations Reported value. Please note that using the ignore strategy can lead to a large proportion of reads not being mapped at all, depending on the characteristics of the data set. With human data it's recommended to use either the Random or Max strategies. Mapping paired reads When mapping paired reads (either mate pair or paired end) the non-specific reads strategy is basically ignored. Instead: The proximity of the pairs is used to choose which location to map. If more than one 'closest' pair mapping is found then all (up to 3) will be returned in the output, to allow mapping of mates to long repeated regions, if required. Alternatively, these can be narrowed down further by using the 'Max Locations Reported' value, to give the single best alignment of the pair. If only one of the pair is mapped and this 'orphan' read has multiple mapping locations then the Random non-specific reads strategy is used for that single read. Choose the reference sequence and values for the alignment parameters and click Next. This opens the dialog shown in figure 6. Setting the quality parameters The AMAP step has two in-built quality filters (unless the input is paired reads). Reads whose scores are below these values after alignment (i.e. reads with a very low quality alignment) will automatically be discarded. This can save some manual filtering of the results. The two filter values correspond to two standard SAM tags, AS and UQ. AS is the 'alignment score' that is generated at the end of the AMAP alignment (min 0, max 100, default 60). UQ is the 'Phred likelihood of the mapping being correct' and is calculated early on during the AMAP run (mix 33, max 126, default 10

100). Reads with either value less then these thresholds will be discarded by AMAP. Note that these quality filters are ignored when using paired reads for the input. Figure 7 Quality and performance settings The final step is the standard CLC dialog for results handling. Choose whether to Save or Open the results, and whether or note to Make Logs. To start the Omixon PreciseAlign click Finish. 11

Analyzing the results Note that the plug-in uses the facilities within the CLC bio High-Throughput Sequencing plug-ins, which come built-in to the CLC Genomics Workbench. If for any reason the plug-in is unable to convert the results correctly using these plug-ins then it will display the results as a text file (in SAM format) instead. The SNP Detection and DIP detection tools within the CLC Genomics Workbench can be used to display the variants found by Omixon PreciseAlign. Check the User Manual of the CLC Genomics Workbench for more info on these steps. SNP Detection: Toolkit High-Throughput Sequencing SNP Detection DIP Detection (indels): Toolkit High-Throughput Sequencing DIP Detection As usual, the mapped reads object can also be exported from the Workbench in SAM format. File Export or Export in the Toolbar 12

Installation Workbench plug-in installation The Omixon PreciseAlign plug-in is installed as a Workbench plug-in. Note that the Workbench plug-in is used the client for the Server plug-in. Plug-ins are installed using the plug-in manager: Help in the Menu Bar Plug-ins and Resources... ( ) or Plug-ins ( ) in the Toolbar The plug-in manager has four tabs at the top: Manage Plug-ins. This is an overview of plug-ins that are installed. Download Plug-ins. This is an overview of available plug-ins on CLC bio's server. Manage Resources. This is an overview of resources that are installed. Download Resources. This is an overview of available resources on CLC bio's server. To install a plug-in, click the Download Plug-ins tab. This will display an overview of the plug-ins that are available for download and installation (see figure 6). Figure 6 The plug-ins that are available for download. Clicking a plug-in will display additional information at the right side of the dialog. This will also display a button: Download and Install. Click the Omixon PreciseAlign plug-in and press Download and Install. A dialog displaying progress is now shown, and the plug-in is downloaded and installed. If the Omixon PreciseAlign plug-in is not shown on the server, and you have it on your computer (e.g. if you have downloaded it from the CLC bio web-site), you can install it by clicking the Install from File button at the bottom of the dialog. This will open a dialog where you can browse for the plug-in. The plug-in file should be a file of the type ".cpa". 13

When you close the dialog, you will be asked whether you wish to restart the CLC Workbench. The plug-in will not be ready for use before you have restarted. Server plug-in installation First, the server plug-in should be downloaded from the CLC bio web site. Then, after logging in to the server administration web site, the plug-in can be installed as follows: Admin Plugins Install new plug-in Click on the Browse button. This will open a dialog where you can browse for the plug-in. The plug-in file should be a file of the type ".cpa". When you close the dialog, you can then click the Install Plug-in button. System requirements These plug-ins need a CLC bio Genomics Workbench version 5.0 or above. For analyzing 230 MB of short reads at least 2GB of memory is required. 14

Uninstall Workbench uninstall Plug-ins are uninstalled using the plug-in manager: Help in the Menu Bar Plug-ins and Resources... ( ) or Plug-ins ( ) in the Toolbar This will open the dialog shown in figure 7. Figure 7 The plug-in manager with plug-ins installed. The installed plug-ins are shown in this dialog. To uninstall: Click the Omixon PreciseAlign plug-in Uninstall If you do not wish to completely uninstall the plug-in but you don't want it to be used next time you start the Workbench, click the Disable button. When you close the dialog, you will be asked whether you wish to restart the workbench. The plugin will not be uninstalled before the workbench is restarted. Server uninstall After logging in to the server administration web site, the plug-in can be installed as follows: Admin Plugins Installed plug-ins 15

Find the Omixon plug-in in the list, and the click on the Unistall Omixon PreciseAlign Server plug-in button. This will open a dialog where you can confirm. 16