User's Guide to DNASTAR SeqMan NGen For Windows, Macintosh and Linux

Size: px

Start display at page:

Download "User's Guide to DNASTAR SeqMan NGen For Windows, Macintosh and Linux"

Maurice Dickerson
6 years ago
Views:

1 User's Guide to DNASTAR SeqMan NGen 12.0 For Windows, Macintosh and Linux DNASTAR, Inc. 2014

3 Contents SeqMan NGen Overview...7 Wizard Navigation...8 Non-English Keyboards...8 Before You Begin...9 The Welcome Screen...9 Choose Project Type...11 Choose Assembly Type...13 Reference-Guided Assembly with Gap Closure...15 Metagenomics/16S rrna Workflows...20 Viral-Host Integration Workflows...20 BAM Import...21 Recalculate SNPs...22 Set Up Project Files...23 Set Up Project Files (De Novo, Special Templated)...23 Set Up Project Files (All Others)...26 Input Template/Host Files...27 Downloading and Extracting a Genome Package...30 Annotating Template Sequence Prior to Assembly...31 Input Viral/Biome Genomes...32 Input Sequence Files...33 Edit Group Names...36 Edit MID Tags...37 Set Pair Information (Certain Sanger Data)...39 User's Guide to DNASTAR SeqMan NGen 12 Contents iii

4 Set Pair Information (All Others)...40 Using Paired End Data...42 Illumina Pairs...43 Roche 454 Pairs...45 Sanger Pairs...46 Example Regular Expressions...46 Set Up Experiments...47 Input BAM Layout File...50 Read Options...51 Files and Folders Dialogs...53 Advanced Trim/Scan Options...54 Assembly Options...57 Assembly Options (BAM Layout)...57 Assembly Options (De Novo, Special Templated)...58 Assembly Options (All Others)...61 Advanced Assembly Options...63 Advanced Options (Normal Templated, Reference-Guided)...63 Advanced Assembly Options (De Novo)...70 Advanced Options (BAM Layout)...74 SNP Options Dialog...74 The Your assembly is ready to begin Dialog...75 The Assembly Log...77 The Project Report Dialog...78 The Assembly Report...79 Output Files for Different Workflows...80 XNG Workflow Output...81 SNG Workflow Output...86 How To...87 View Assembly Results in SeqMan Pro...87 iv Contents User's Guide to DNASTAR SeqMan NGen 12

5 Create a SeqMan NGen Assembly to Use with ArrayStar...89 Create an Assembly for Validation Control Accuracy Testing...89 Export ArrayStar Sequences to SeqMan NGen...90 Manually Specify an Isoform...90 Make a Custom VCF File...91 Make a Custom BED File...93 Control Automatic Software Updates...94 Frequently Asked Questions...95 Why doesn t SeqMan NGen run in the command line?...95 Why isn t SeqMan NGen on Ubuntu's installed software list?...95 Why is the Export Aligned value higher than expected?...96 Why is the MID column missing from the SeqMan Pro SNP Report?...96 What file extensions are used for unassembled sequences?...96 Why can't I add a downloaded genome package as my template?...97 Why do assembly statistics vary from version 3.0 to 3.1?...97 Appendix...98 Supported File Types...98 Manifest File Formats...98 Repeat Handling...99 Detection of Structural Variations Equivalence Between Wizard Settings and SNG Scripting Commands Complete List of Parameters by Read Technology Normal Templated Assembly Metagenomics Normal Templated Assembly All Others Special Templated Assembly - Genome De Novo Assembly - Transcriptome De Novo Assembly - Genome Assembly De Novo Assembly - Metagenomics SeqMan NGen Scripting Manual User's Guide to DNASTAR SeqMan NGen 12 Contents v

6 SeqMan NGen Assemblers Scripting Manual Conventions XNG Commands SNG Commands Research References Index vi Contents User's Guide to DNASTAR SeqMan NGen 12

7 SeqMan NGen Overview Note: For Customer Support contact information, and for a link to the most up-to-date version of this help, please see Before You Begin. DNASTAR s SeqMan NGen software gives you the ability to assemble and analyze large genomes with unsurpassed speed using sequence data from any major next-gen sequencing platform. Features of the software allow you to: Assemble human or other large eukaryotic genomes against a genomic template quickly and easily on a desktop computer. Assemble nearly any type of next-gen data, including Ion Torrent, Illumina, Pacific Biosciences, Roche 454, and SOLiD. Perform reference-guided assemblies of billions of sequence reads and de novo assemblies of up to 30 million sequence reads (genome sizes up to 50 megabases). Note: The upper limit for project size depends on many factors, including your computer s level of RAM and its processor speed. See our recommended technical requirements for more information. Assemble multi-sample data, such as MID-tagged 454 samples, for later analysis in SeqMan Pro. Utilize an interactive, data-rich SNP report, including probabilities and genotypes determined with Bayesian statistics as well as dbsnp, GERP and COSMIC associations. Detect potential structural variation regions (displayed as a table in SeqMan Pro). Align reads against a database of genomic templates. Recalculate SNPs for a BAM-based assembly project. If you plan to create a templated assembly and wish to use the dbsnp, GERP or COSMIC association features, you ll first need to download a free DNASTAR genome template package that contains the template sequence, annotations, and associated dbsnp entries. Packages are available for a variety of model organisms. SeqMan NGen utilizes scripts to run an assembly. SeqMan NGen s wizard allows effortless generation of scripts with no programming required. For video tutorials on using the SeqMan NGen wizard, or access to a command line scripting manual, please click here and choose the Resources tab on the left. User's Guide to DNASTAR SeqMan NGen 12 SeqMan NGen Overview 7

8 Wizard Navigation Navigate through the SeqMan NGen wizard using the buttons at the bottom of each dialog. Click the Help button (Win) or the question mark icon (Mac) to launch the user s guide topic for the current panel. Note for Linux Users: The wizard Help button is not active on the Linux platform. Instead, wizard help is provided in this PDF. Scripting command help is available via the SeqManNGen_ScriptingManual.txt file that was included with your SeqMan NGen software. For further assistance, please visit the Training and Support section of our website or contact us at support@dnastar.com. Click < Back and Next > to navigate to the previous or next panel. Click Quit to exit SeqMan NGen. If you have not yet saved a script or run an assembly, the following confirmation prompt will appear: Choose Quit to exit without saving changes, or Cancel to return to the wizard. Non-English Keyboards SeqMan NGen recognizes only standard English-keyboard characters as input. If you are using a non-english keyboard, we recommend that you switch to a virtual English keyboard. Click a link for instructions: Windows 8, Windows 7, Macintosh OS X 10.8, Macintosh OS X 10.9, Linux. 8 SeqMan NGen Overview User's Guide to DNASTAR SeqMan NGen 12

9 Before You Begin We re here to help! If you have any difficulties with or questions about this application, please contact a DNASTAR support representative: support@dnastar.com Phone (Madison, WI, USA): In the USA and Canada, call toll free: In the UK, call free on: In Germany, call free on: This help document pertains to SeqMan NGen version 12, and was last updated on June 12, If you accessed this help from within a Lasergene application: The help installed with your application was current at the time of the version release. To view the most recent help online, please visit our Training & Support page and click the appropriate application link. To access free video tutorials for this product, please visit and use the tabs at the top to choose Support > Videos. For copyright and trademark information, please see the Legal Information page of our website. The Welcome Screen The Welcome screen is the initial SeqMan NGen wizard screen for all workflows. User's Guide to DNASTAR SeqMan NGen 12 Before You Begin 9

Choose from the following options: Select Create new assembly project to create a new project step-by-step using the wizard. Then click Next to proceed to the Choose Project Type screen.

10 Choose from the following options: Select Create new assembly project to create a new project step-by-step using the wizard. Then click Next to proceed to the Choose Project Type screen. Choose Load existing script to load parameters from a past project into the wizard; this is similar to the File > Open command in other DNASTAR applications. Then click Next to open a file browser. After loading the script, you will be taken automatically to the Choose Project Type screen. Select Run existing script to load a past project and proceed to the end of the wizard without needing to forward through intervening screens. Then click Next to open a file browser. After loading the script, you will be taken automatically to the Your assembly is ready to begin dialog. Choose Import BAM file to go directly to the BAM Import screen, from which you can import the desired BAM file (*.bam or *.assembly). This choice also leads to the Recalculate SNPs workflow. Click the About SeqMan NGen link to view basic information about this application, such as its version number. You must click OK to close the information window before continuing with the wizard. 10 The Welcome Screen User's Guide to DNASTAR SeqMan NGen 12

11 Choose Project Type The Choose Project Type dialog lets you select from several workflow types. SeqMan NGen will then populate the rest of the wizard with appropriate default parameters for your project. User's Guide to DNASTAR SeqMan NGen 12 Choose Project Type 11

12 Your selection in this screen determines which assembly types are enabled in the next screen, entitled Choose Assembly Type. 12 Choose Project Type User's Guide to DNASTAR SeqMan NGen 12

13 This selection in Choose Project Type Genome assembly Exome assembly Mendelian / germline gene panel assembly Cancer / somatic gene panel assembly Transcriptome / ChIP-seq assembly mirna assembly Metagenomics/1 6S rrna assembly Viral-Host Integration Normal Template d Template d with Host Removal enables these options in Choose Assembly Type De Novo Template De with Referenc d with Nov Host e-guided Control o Remov al Special Template d x x x x x x x x x x x x x x x x x x x x x (The Choose Assembly Type screen is not present in this workflow; host removal is performed automatically) Once you are finished, click Next > to continue to the next wizard screen. Choose Assembly Type The Choose Assembly Type dialog allows you to choose between several types of templated and/or de novo assemblies. The text below each selection can assist you in determining the suitability of a particular assembly type for your data set. User's Guide to DNASTAR SeqMan NGen 12 Choose Assembly Type 13

Templated assembly normal workflows (called Templated assembly in the Metagenomics/16S rrna Assembly workflow) To assemble/align reads onto one or more reference sequences/templates.

14 The options available in this dialog depend upon what you selected in the previous screen. See the table in Choose Project Type for details. Below is the version of the screen that appears for the Genome Assembly workflow. The following version is seen in the Metagenomics/16S rrna assembly workflow. Templated assembly normal workflows (called Templated assembly in the Metagenomics/16S rrna Assembly workflow) To assemble/align reads onto one or more reference sequences/templates. This type of assembly can include billions of reads and large eukaryotic genomes. The BAM-formatted assembly cannot be edited, but can be viewed and analyzed using a utility such as DNASTAR s SeqMan Pro. Templated assemblies with control To assemble one or more related assemblies with a designated control set. 14 Choose Assembly Type User's Guide to DNASTAR SeqMan NGen 12

15 Templated assembly with host removal To remove the DNA of a specified host before assembling/aligning the remaining reads onto one or more reference sequences/templates. This option is only available in the Metagenomics/16S rrna assembly workflow. De novo assembly To run a de novo (untemplated) assembly of up to 30 million sequence reads and up to a 50 Mbase genome. When assembling a data set de novo, we recommend using paired end data if available. Note: The De novo assembly option does not appear if you selected Targeted resequencing / Exome assembly in the Choose Project Type wizard dialog. De novo assembly with host removal To remove the DNA of a specified host before running a de novo (untemplated) assembly. This option is only available in the Metagenomics/16S rrna assembly workflow. Reference-guided assembly with gap closure To assemble/align reads onto one or more reference sequences/templates. This option automates the assembly of indels using mate-pair data, and can include up to 10 million reads and up to a 100 Mbase genome. The SQDformatted assembly can be edited at a later time using SeqMan Pro. For more information about this assembly type, see Reference-Guided Assembly with Gap Closure. Templated assembly special workflows To assemble/align reads onto one or more reference sequences/templates. This type of assembly can include up to 10 million reads and up to a 100 Mbase genome. It can be edited at a later time using a utility like SeqMan Pro. Note: The Templated assembly - special workflows selection was eliminated in SeqMan NGen 4.1, but was reintroduced in as a heritage/legacy workflow for use only with the Genome assembly project type. We encourage you to use the standard templated or the reference-guided workflows whenever possible. Once you are finished, click Next > to continue to the next wizard screen. SeqMan NGen will populate the rest of the wizard with appropriate default parameters for your assembly. Reference-Guided Assembly with Gap Closure The Reference-guided assembly with gap closure, one of the selections in the Choose Assembly Type screen, is a semi-automated assembly option that utilizes both reference-guided ("templated") and de novo assembly steps to resolve three types of structural variation (SV): insertions, deletions and replacements (indels) with minimal user intervention. The following conditions apply if you wish to follow this workflow: In the Choose Project Type screen, you should choose Genome Assembly. In the Choose Assembly Type screen, you should choose Reference-guided assembly with gap closure. User's Guide to DNASTAR SeqMan NGen 12 Choose Assembly Type 15

16 Your data should be from a haploid genome with at least one mate pair data set with read lengths of 100 bases or greater. Your total number of reads should be 10 million or less. If you use a larger data set, only the first 10 million reads will be used. For mate pair data, equal numbers of matching forward and reverse reads are processed. Steps 1-3: Assembly in SeqMan NGen During assembly, data is processed in several stages (see figures below): Step 1) Data is mapped and aligned to a user-defined reference genome and then analyzed for characteristic SV motifs. Step 2) The reference sequence is split at the detected SV sites, forming a series of ordered contigs. Step 3a) Mate pair and split reads from each SV event are collected in site-specific pools and assembled de novo. Deletions are detected using three types of data: split reads, spanning paired-end reads, and sequence coverage information. For insertions and replacements, mate pair reads corresponding to the new sequence are collected from the unassembled read pool. Only reads anchored by mates flanking the SV in the main assembly are used at this stage. Step 3b) The de novo assembled contigs are then brought into the main assembly and positioned consistently with the mate pair information. Step 3c) For SVs where the gap is not completely covered by the de novo assembled contigs (e.g. insertions longer than twice the size of the insert library), additional reads from the unassembled read pool matching and extending the ends of the joining contigs are added in an attempt to walk across the gap. This walk is terminated when either no new reads are found or when a repeated element is encountered. At the end of assembly in SeqMan NGen, two types of output files are produced. These allow the project to be evaluated and further processed in SeqMan Pro: An *.assembly package with a non-editable BAM formatted alignment file of the initial reference-guided assembly without further processing. The fully processed assembly in an editable SeqMan *.sqd file format. 16 Choose Assembly Type User's Guide to DNASTAR SeqMan NGen 12

17 Step 4: Further Processing in SeqMan Pro Step 4) The editable *.sqd document, containing the fully processed assembly, is used for gap closure to complete the new sequence. In SeqMan Pro's Project Summary window, the contigs will appear in a single ordered scaffold with the de novo generated contigs. The values in the contig position column of the window roughly correspond to the 5 position of the first base of that contig in the reference genome, although the position is generally shifted 20 bp downstream to accommodate positions for the gap filling contigs. The ordered contigs can be merged using SeqMan Pro's Contig > Align Contigs End-to-End option. This option ensures that only adjacent contigs are considered for merging, mitigating against false joins caused by repeat elements. With sufficient depth of coverage, this step should close a significant number of gaps. However, some gaps may remain. These may be caused, for example, by long insertions that could not be reliably walked across in this automated fashion. The remaining gaps that require more manual intervention can be closed using the suite of tools in SeqMan NGen and SeqMan Pro. Evaluation in SeqMan Pro The *.assembly package allows you to inspect the initial reference-guided assembly and detected SV events via SeqMan Pro's Structural Variation table. Single nucleotide polymorphisms (SNPs) and small insertions and deletions can also be inspected using the SNP table. The following images show Steps 1-4 for deletions and three types of insertions. The terms "split reads" and "junction reads" are used in some of these images and are defined as: Split reads Reads in which the first portion matches one location in the genome, and the adjacent portion matches a downstream location on the same strand. The endpoint of the first segment and the start point of the second segment generally define the breakpoints of the deletion, although the exact positions may vary by a few bases in some cases. The presence of multiple split reads at a given position is required to avoid spurious splits caused by, for example, micro-repeats in the genome. Junction reads Mate pair reads where one read aligns either upstream or downstream of the structural variant, and its mate aligns either on the other side ( spanning pairs ) or within the new sequence (in the case of insertions and indels). In the latter case, reads within the inserted sequence are identified from the unassembled read pool by virtue of their mates in the assembly. User's Guide to DNASTAR SeqMan NGen 12 Choose Assembly Type 17

18 18 Choose Assembly Type User's Guide to DNASTAR SeqMan NGen 12

19 User's Guide to DNASTAR SeqMan NGen 12 Choose Assembly Type 19

20 Metagenomics/16S rrna Workflows The Metagenomics/16S rrna workflow, one of the selections in the Choose Project Type screen, offers both reference-guided ("templated") and de novo assembly options, with and without removal of host DNA. The default parameters for this workflow have been optimized to take into account the short read lengths and presence of repetitive DNA sequences common to metagenomic and 16S rrna data. To follow this workflow: In the Choose Project Type screen, select Metagenomics/16S rrna assembly. When you make this selection, SeqMan NGen automatically pre-filters Metagenomics/16S data prior to assembly by removing redundant, low-quality sequences. In the Choose Assembly Type screen, select from any of the four options: templated assembly, templated assembly with host removal, de novo assembly, or de novo assembly with host removal. If you select one of the two options involving host removal, the workflow will include the Input Host Files screen. Many genome packages are available for free download from the DNASTAR website. SeqMan NGen will remove host DNA first, then assemble the remaining data using the method you specified (templated or de novo). If you select a templated option, the workflow will include the Input Biome Genomes screen. Reference sequences can be downloaded from any 16S rrna database, such as Silva, Greengenes or the Ribosomal Database Project (RDP). In the Input Sequences screen, you may enter either single-end or paired-end reads. If available, paired-end reads are recommended for highest accuracy. Viral-Host Integration Workflows The Viral-Host Integration workflow, chosen via the Choose Project Type screen, is a special type of assembly used to locate putative viral insertion sites. To follow this workflow: In the Choose Project Type screen, select Viral-Host Integration. When you make this selection, SeqMan NGen automatically sets up a templated assembly that is optimized for locating viral insertion sites. In the Input Host Files screen, input one or more host files. Many genome packages are available for free download from the DNASTAR website. In the Input Viral Genomes screen, add the viral genome(s) of interest. 20 Choose Assembly Type User's Guide to DNASTAR SeqMan NGen 12

21 In the Input Sequence Files screen, input your sequencing reads. These should consist of the virus-infected host DNA for which you wish to determine likely viral insertion sites. Since chimeric reads (sequences consisting of both host and viral DNA) usually indicate viral insertion sites, SeqMan NGen looks for chimeric reads in a three-step process: 1) The viral genome is used as the initial assembly template. 2) The sub-set of reads that mapped to the viral genome is then re-assembled against the host template. 3) The host template assembly results are output in BAM file format. To explore possible viral insertion sites, launch SeqMan Pro and view the Coverage Reports for the individual contigs (Contig > Coverage Report). During both templated assembly steps, SeqMan NGen "masks" (trims) whichever half of the chimeric read does not match the template for that step. Use the Coverage Report to navigate to positions with multiple reads, as evidenced in the depth column. The reads at these positions should be trimmed to the same base indicating the insertion site. You may "untrim" the reads to verify that they also contain viral sequence. BAM Import The BAM Import dialog allows you to choose whether you wish to align against a BAM layout file or recalculate SNPs for an existing alignment. After you make a selection, SeqMan NGen will populate the rest of the wizard with appropriate default parameters for your assembly. User's Guide to DNASTAR SeqMan NGen 12 BAM Import 21

22 Align BAM layout file To assemble reads against a BAM-format template. This also gaps a BAM file that is not already gapped. SNPs are calculated automatically as part of the alignment process. Recalculate SNPs To calculate SNPs for a finished assembly. Once you are finished, click Next > to continue to the next wizard screen. Recalculate SNPs If you are recalculating SNPs for a BAM assembly, you must choose a project type and location from the Recalculate SNPs dialog. Project type Choose SeqMan NGen Assembly Package if you wish to recalculate SNPs on an existing assembly package. SeqMan NGen will recalculate SNPs into the existing package. Otherwise, choose BAM Project and use the Browse button to designate a BAM file. 22 Recalculate SNPs User's Guide to DNASTAR SeqMan NGen 12

23 Project If you select a SeqMan NGen Assembly Package as the project type, you must use the Browse button to select the existing assembly. If you selected BAM Project in the drop-down menu, you must use the Browse button to select the BAM file. Note: The BAM file must be fully gapped. Genome ploidy Select the type of ploidy for your project. Choosing Haploid or Diploid establishes the statistical model SeqMan NGen will use in estimating probabilities during SNP calls. Selecting Population / other (e.g. for a polyploid genome or if doing a population study) causes SeqMan NGen not to calculate probabilities. If desired, click the SNP Options button to open the SNP Options dialog. This dialog allows you to view and edit options for recalculating SNPs. Once you are finished, click Next > to continue to the next wizard screen. Set Up Project Files You must select a name and location for your project in the Set Up Project Files dialog before proceeding further in the wizard. There are two versions of this dialog depending upon your choices in previous wizard screens. See the links below for detailed information. Set Up Project Files (De Novo, Special Templated) You must select a name and location for your project in the Set Up Project Files dialog before proceeding further in the wizard. The following version of the dialog is shown only when you are following the de novo or special templated workflows. Note: For other workflows, see Set Up Project files (All Others). User's Guide to DNASTAR SeqMan NGen 12 Set Up Project Files 23

There are two mandatory fields in this dialog: Project name Enter a name for all output files, including the finished assembly. By default, alignment files are saved in SeqMan Pro (*.sqd) format.

24 There are two mandatory fields in this dialog: Project name Enter a name for all output files, including the finished assembly. By default, alignment files are saved in SeqMan Pro (*.sqd) format. Project folder Use the Browse button to select a location for your assembly output files. Required disk space may range from 1 GB to 5 TB, depending on a variety of factors. See our technical requirements page for more information. Note: Never save the assembly output files directly to the desktop, as the many intermediate files and folders created during assembly may hamper or prevent further computer operations. However, files may be saved to a folder on the desktop. The following checkboxes let you request additional output files. These will all have the name and location specified above, but different file extensions: Save project as Check one or more boxes to save assembly output files in these formats: 24 Set Up Project Files User's Guide to DNASTAR SeqMan NGen 12

25 Format Type Extension Editable? Viewable in SeqMan Pro? Read Limit SeqMan Pro format 1 *.sqd Yes Yes 10 million ACE format *.ace Yes Yes 10 million BAM format *.bam No Yes None SAM format 2 *.sam Export only No None 1 If your assembly exceeds the read limit size for SeqMan Pro format, it will automatically be saved in BAM format instead. 2 The SAM format option is grayed out (unavailable) if you selected De novo assembly in the Choose Assembly Type dialog. Save unassembled reads Check the box to save all sequences that were not assembled in the project as a multi-sequence *.fastq file. A default quality score of 15 will be given to each base. Save contigs to fasta Check the box to save the consensus sequences from each contig in the assembly as a multi-sequence *.fasta file. Save Report Check the box to save an assembly report text file. Note for all users: Choosing SeqMan Pro format in Save project as causes all report information to be saved within the SeqMan Project file (*.sqd), even if you do not check the box to save the report separately. To view the report in SeqMan Pro, choose Project > Report. Note for Windows users: To open a text report with the correct formatting displayed, we recommend using Wordpad, Notepad++, or Microsoft Excel, and not the default Windows text editor, Notepad. Once you are finished, click Next > to continue to the next wizard screen. If you choose a name that already exists in the chosen location, you will receive the following warning. User's Guide to DNASTAR SeqMan NGen 12 Set Up Project Files 25

Click OK to continue and over-write the earlier project; or Cancel to return to the wizard screen, where you may change the project name and/or location.

26 Click OK to continue and over-write the earlier project; or Cancel to return to the wizard screen, where you may change the project name and/or location. Set Up Project Files (All Others) You must select a name and location for your project in the Set Up Project Files dialog before proceeding further in the wizard. The following version of the dialog is shown for all workflows other than the de novo or special templated workflows. Note: For de novo or special templated workflows, see Set Up Project Files (De Novo, Special Templated). 26 Set Up Project Files User's Guide to DNASTAR SeqMan NGen 12

27 Project name Enter a name for all output files, including the finished assembly. The finished assembly will be saved in BAM format. Project folder Use the Browse button to select a location for your assembly output files. Click the link below for information about disk space requirements. Temporary file location Use the Browse button to designate a location for the intermediate files produced during assembly. We recommend using an external hard drive as the temporary file location. SeqMan NGen will remember and use the temporary file location for future assemblies. Note 1: Never save the assembly output files or temporary files directly to the desktop, as the many intermediate files and folders created during assembly may hamper or prevent further computer operations. However, files may be saved to a folder on the desktop. Note 2: By default, most temporary files are deleted when the assembly is complete. Other files (e.g., [template_name].fasinfo.sqlite and [template_name].mer) may remain in the temporary file location in order to facilitate efficient reassembly of data in the future. Use the link Click here for technical requirements to open a DNASTAR web page describing technical requirements for reference-guided and de novo assemblies. Once you are finished, click Next > to continue to the next wizard screen. Input Template/Host Files If you are doing a Viral-Host Integration or a Metagenomics/16S rrna workflow, this screen is called "Input Host Files." For other templated assembly workflows, it is named "Input Template Files." You must enter one or more template (reference) or host sequences here before proceeding further in the wizard. Note: If you wish to manually specify an isoform to use in SNP calling, you will need to perform a minor edit to the template sequence before adding it here. See Manually Specify an Isoform for more information. User's Guide to DNASTAR SeqMan NGen 12 Input Template/Host Files 27

28 Depending upon your workflow, there will be between three and five buttons on the right of the dialog. Add Click to navigate to and select one or more individual sequence(s) to use as template or host sequences. Note: Before adding a reference file to certain projects, you may wish to first annotate it in SeqBuilder for known SNPs/variations and other features. SeqMan NGen supports the following file formats for reference and host sequences. 28 Input Template/Host Files User's Guide to DNASTAR SeqMan NGen 12

29 Format Extension Notes DNASTAR SEQ files GenBank files FASTA General Feature Format sequence files Genome template packages *.seq *.gbk *.fas, *.fna, *.fasta, *.txt *.gff *.genometemplatepackage Feature-containing files can be used directly as input for both the sequence and feature annotation. GenBank flat files be used directly as input for both the sequence and feature annotation. Can be used with or without an associated *.gff annotation file. Sequence-containing GFF files can be used directly as input for both the sequence and feature annotation. This option is available for templated assemblies only. See Note (at beginning of this topic) and Add Genome Package (below) for details on downloading and working with genome packages. Add Folder Click to navigate to and select an entire folder of sequences to use as template or host sequences. All sequences within the specified folder will be added. After adding files using the Add or Add Folder buttons, the Add Genome Package button will be grayed out. Add Genome Package Click this button to browse to an extracted genome package on your hard drive. If you are following a templated workflow and wish to use DNASTAR's database association features, you must input one of the DNASTAR genome packages at this step. Note: If you haven t yet downloaded and/or extracted a genome package, learn how to do so in Downloading and Extracting a Genome Package. Your downloaded genome package will be utilized by SeqMan NGen, but the actual genome files will remain in their original locations on your hard drive. After adding files using the Add or Add Folder buttons, the Add Genome Package button will be grayed out. Remove Click to remove a selected (highlighted) file from the list. Add Features (This option is not available in the viral-host integration or Metagenomics/16S rrna workflows.) Click to add separate *.gff feature files. Feature files are not displayed in the reference sequence window. User's Guide to DNASTAR SeqMan NGen 12 Input Template/Host Files 29

30 Include alternative assembly templates check box - If you added a genome package as your template, this new option will appear near the bottom of the dialog. Check the box to include alternate sequence (alt loci) representation for variant regions. See the Variation section of this Genome Reference Consortium announcement for details. Alternate sequences include those known to be in a particular chromosome, but whose exact position is unknown; and sequences with known positions, but otherwise incomplete entries. VCF file check box (not visible in all workflows) - Certain SeqMan NGen workflows allow you to import a custom VCF SNP file (e.g., created in SeqMan Pro or ArrayStar) with data from one or more assemblies. To add a VCF file, check the box next to VCF file and then use the corresponding Browse button to navigate to the file. If you elect to do this, positions within the VCF file will be given a VCF SNP ID during the assembly process. After assembly, information about each position can be viewed in the SeqMan Pro SNP Report. (Within SeqMan Pro, choose SNP > SNP Report.) Note 1: SeqMan NGen only supports one VCF file per assembly project. If you have multiple VCF files (e.g., one per chromosome), you will need to merge the information into a single VCF file before browsing to the file. For more information, see Make a Custom VCF File. Note 2: If you selected Templated assemblies with control from the Choose Assembly Type screen, the VCF import option is not visible in this screen. Instead, VCF files are imported from the Set Up Experiments screen. Targeted Regions file check box - If you chose Exome assembly, Mendelian/germline gene panel assembly, or Cancer/somatic gene panel assembly from the Choose Project Type screen, you cannot proceed to the next screen until you have specified a file containing targeted region information. Click the Browse button to the right of the Targeted Regions file checkbox (only visible in the workflows listed) and navigate to a BED or manifest file. BED files must have the extension *.bed. Manifest files can have various extensions, but must be in the correct format. Once you are finished, click Next > to continue to the next wizard screen. Downloading and Extracting a Genome Package Genome template packages can be downloaded and added via the Input Template/Host Files wizard screen. 30 Input Template/Host Files User's Guide to DNASTAR SeqMan NGen 12

31 To download a genome package: Click the Download genome template packages link at the bottom of the wizard screen. This opens a DNASTAR web page from which you can download packages from a variety of genomes for free. Each package contains the template sequence, annotations, and associated dbsnp linking information. Human genome packages also contain GERP scores and COSMIC linking information. Downloaded genome packages are saved on your computer as ZIP files, and must be extracted prior to use. To extract a downloaded genome package: On Macintosh: Double-click on the ZIP file. The files will be automatically extracted via the Archive Utility. On Windows 7 & Windows 8: Double-click on the ZIP file. In the ensuing Explorer window, click Extract all files from the top left. Choose a location for the files and select Extract. See Input Template/Host Files for instructions on adding the genome package to SeqMan NGen. Note: SeqMan NGen can read and produce output using a variety of common chromosome naming conventions, including chr1 and ch1, as well as Arabic and Roman numerals. Chromosome names are captured from genome template packages and used to assign contig IDs to entries from BED, VCF and manifest files. Annotating Template Sequence Prior to Assembly Using annotated template sequences in SeqMan NGen may enable you to better analyze the identified putative SNPs when viewing your assembled project in SeqMan Pro. If desired, annotate your template sequence in SeqBuilder (the Lasergene application for sequence editing and visualization) prior to adding it to the Input Template/Host Files dialog. 1) Launch SeqBuilder. 2) Go to File > Open and select your template sequence. 3) Select the range of sequence where a feature will be added. (Use Edit > Go to Position to navigate quickly up and down your sequence.) 4) Go to Features > New Feature. A new misc_feature will be added to your sequence and displayed in the Feature List. User's Guide to DNASTAR SeqMan NGen 12 Input Template/Host Files 31

5) Click on misc_feature from within the Feature List and select the appropriate feature type from the list provided. For example: For SNPs, choose Variation > variation.

32 5) Click on misc_feature from within the Feature List and select the appropriate feature type from the list provided. For example: For SNPs, choose Variation > variation. For exons, choose Gene > exon. For CDS features, choose Transcript > CDS. For origin of replication, choose Structure > rep_origin. Note: The next feature you create will automatically be of the same feature type you just selected, enabling you to create all the features of one type more quickly. 6) Repeat steps 3-5 until all of your features have been added. Then go to File > Save As and save your sequence in *.sbd, *.seq or *.gbk format. Your annotated template sequence is now ready for assembly in SeqMan NGen and subsequent analysis in SeqMan Pro. Input Viral/Biome Genomes In viral-host integration workflows, this screen is named "Input Viral Genomes." In Metagenomics/16S rrna workflows, it is called "Input Biome Genomes." Other than the name, the wizard screens are identical. You must enter one or more reference sequences before proceeding further in the wizard. 32 Input Viral/Biome Genomes User's Guide to DNASTAR SeqMan NGen 12

33 Add Click to navigate to and select one or more individual genomes. SeqMan NGen supports the following file formats: *.seq, *.gbk, *.fas, *.fna, *.fasta, *.txt, *.gff and *.genometemplatepackage. Note: If you are following the Metagenomics/16S rrna workflow, we recommend inputting a biome genome that is in Fasta, rather than GenBank, format. If you use a GenBank file, SeqMan NGen may run out of memory parsing all the features in all the templates. If your biome genome is currently in GenBank format, use the Convert File Type template in DNASTAR s SeqNinja utility to convert it to Fasta format before importing it into SeqMan NGen. Add Folder Click to navigate to and select an entire folder of genome sequences. All sequences within the specified folder will be added. Note: Once you have added files using the Add or Add Folder buttons, the Add Genome Package button will be grayed out. Add Genome Package Click this button to browse to the extracted genome package on your hard drive (see Downloading and Extracting a Genome Package). Your downloaded genome package will be utilized by SeqMan NGen, but the actual genome files will remain in their original locations on your hard drive. Note: Once you have added an extracted genome package using the Add Genome Package button, the Add and Add Folder buttons will be grayed out. Remove Click to remove a selected (highlighted) file from the list. Once you are finished, click Next > to continue to the next wizard screen. Input Sequence Files You must choose a sequencing technology and enter one or more read files in the Input Sequence Files dialog before continuing with the wizard. User's Guide to DNASTAR SeqMan NGen 12 Input Sequence Files 33

Bio, Sanger (not available in all workflows) or Other. The default values for parameters and other assembly options in subsequent panels will be based on your selection.

34 1) You must select a Read technology from the drop-down menu before proceeding to the next screen. If you do not make a selection here before clicking Next, you will receive the following reminder: Choose from Illumina < 50 nt, Illumina > 50 nt, 454, Ion Torrent, Pac Bio, Sanger (not available in all workflows) or Other. The default values for parameters and other assembly options in subsequent panels will be based on your selection. The following notes refer to specific workflows or data types: If you are doing a reference-guided assembly with gap closure, you must select either Illumina > 50 nt, 454 or Ion Torrent. 34 Input Sequence Files User's Guide to DNASTAR SeqMan NGen 12

35 For de novo assemblies, if you select Illumina > 50 nt and enter an insert size of 150 bp or less in the Set Pair Information dialog, the assembler will assume the reads overlap and will attempt to create a single super-read from each pair. Read pairs that cannot be merged, either because they do not overlap or have numerous errors in the overlapping region, will not be included in the assembly. Sanger is only included in the menu if you are doing a de novo assembly. Both types of Ion Torrent paired reads "mate pairs" and "paired ends" are supported. 2) If you are using multi-sample data, check the box to the left of Multi-sample data, then click Edit MID Tags to open a related customization dialog: Edit Group Names for Illumina, PacBio and Ion Torrent data; Edit MID Tags for 454 data. (Note that Multi-sample data and Edit MID Tags are disabled for certain project types). After you enter customization information, SeqMan NGen will then produce combined assemblies with data organized and analyzed on a per sample basis. Note: In assembling multi-sample data, SeqMan NGen considers all samples together. This can affect the final gapped alignment and therefore potentially yield slightly different results than assembling each sample individually. 3) If you want to separate the multi-sample data and run them as separate projects, check the Run as separate projects box. The same reference sequence(s) and parameters will be used for all the projects. This option is disabled for 454 read technology and for certain project types. 4) Set up unpaired or paired reads using the buttons on the right of the screen. Assemblies can include both single and paired end read files. Single ended files should be added to the top pane of the window, while paired end files should be added to the bottom pane. In both cases, files may be added individually or in folders. Add Click to navigate to and select one or more individual sequence(s) for your assembly project. See the SeqMan NGen Supported File Types page for supported file formats. Add Folder Click to navigate to and select an entire folder of sequences. All sequences within the specified folder will be added. Remove Click to remove a selected (highlighted) file from the list. If you are doing a reference-guided assembly with gap closure, you must enter at least one set of paired end read files before you can proceed to the next wizard screen. If paired end data are added, the Set Pair Information dialog pops up automatically. See Set Pair Information (Certain Sanger Data) or Set Pair Information (All Others) for more information. Once you have entered pair information, the insert size that you input will appear in the "Insert Size (bp)" column. User's Guide to DNASTAR SeqMan NGen 12 Input Sequence Files 35

If you chose Templated assemblies with control from the Choose Assembly Type screen, adding either unpaired or paired reads will cause a new column, Experiment, to appear in the sequence file area.

36 If you chose Templated assemblies with control from the Choose Assembly Type screen, adding either unpaired or paired reads will cause a new column, Experiment, to appear in the sequence file area. Each of the Experiment cells initially contains the text ENTER NAME. Double-click in each cell and type in a name for that experiment. Data files with the same Experiment name will be assembled together. You will not be allowed to proceed to the next wizard screen until you have entered the experiment information. Once you are finished, click Next > to continue to the next wizard screen. Edit Group Names The Edit Group Names dialog opens when you specify a multi-sample data set in the Input Sequence Files dialog, and then click the Customize Sample Names button. Use this dialog to define custom names for sorting and displaying individual sample sets in SeqMan Pro. Note: If you chose 454 as the Read Technology in the Input Sequence Files dialog, you will instead see the Edit MID Tags dialog. The Read File column contains individual read file names, while the Group Name column contains your custom group names. Ion Torrent, Illumina and PacBio data all require you to enter a Group Name for each sequence file before leaving the dialog. Read files will be separated into individual samples files based on these custom names. 36 Input Sequence Files User's Guide to DNASTAR SeqMan NGen 12

To add a custom group name, type it directly in the Group Name column. Alternatively, you can select the row to be edited and click Customize Name. Type the desired name and click OK.

37 To add a custom group name, type it directly in the Group Name column. Alternatively, you can select the row to be edited and click Customize Name. Type the desired name and click OK. Within a particular project, all the tags must be the same length. Edit MID Tags The Edit MID Tags dialog is accessed when you specify a multi-sample 454 data set in the Input Sequence Files dialog, and then click the Customize Sample Names button. This dialog allows you to define custom names for sorting and displaying individual sample sets in SeqMan Pro. Note: If you chose anything other than 454 as the Read Technology in the Input Sequence Files dialog, you will instead see the Edit Group Names dialog. MID, Indexing, and Barcode tags are synonyms for short stretches (5-10 bases) of unique DNA sequence that are added to samples, allowing the samples to be amplified and sequenced as a User's Guide to DNASTAR SeqMan NGen 12 Input Sequence Files 37

pool. After sequencing, the data for individual samples are separated by identifying the tag sequence and sorting the reads into different bins based on their tags.

The three columns contain the following information: Tag ID contains the 151 Titanium MID identifiers (Roche Technical Bulletin 005-2009). Sequence the sequence of the tag.

38 pool. After sequencing, the data for individual samples are separated by identifying the tag sequence and sorting the reads into different bins based on their tags. Prior to assembly, SeqMan NGen splits all the reads in a specified file(s) into individual fastq files based on the information in this table. The three columns contain the following information: Tag ID contains the 151 Titanium MID identifiers (Roche Technical Bulletin ). Sequence the sequence of the tag. Custom Name displays any custom sample names. To add a custom name for a MID group, select the row to be edited and click Customize Name. Type the desired name and click OK. Note that within a particular project, all the tags must be the same length. To add a non-mid tag to the bottom of the list, click Add New Tag. Type in the information for all three columns of the table, then click OK. To remove a row from the table, select the row you wish to remove, then press Delete. To restore the default MID tag table, click Restore Factory Default. Clicking the Restore Factory Default settings always restores factory defaults, not the user default table. 38 Input Sequence Files User's Guide to DNASTAR SeqMan NGen 12

39 To save a modified table as the default, click Save as User Default. All information in the table, including any custom names, will be saved and used as the initial default table the next time you set up a multi-sample project with the same read technology. Choosing this button will overwrite any previously saved default table. When you are done, click OK. During assembly, read files will be separated into individual samples files based on tag information and your own custom names, if provided. Set Pair Information (Certain Sanger Data) In the Input Sequence Files dialog, the following version of the Set Pair Information dialog pops up automatically when you fulfill these criteria: Any workflow except normal templated + Sanger read technology + paired data Note: If you are doing a normal templated workflow and/or using non-sanger paired data, a different version of the dialog appears. See Set Pair Information (All Others) for details. Insert size Enter the anticipated distance between paired end reads across the library. SeqMan NGen will use this value to automatically calculate the minimum and maximum insert distances. The default value is 3000 bp. Note: During assembly, SeqMan NGen lists this value as a range. If, for example, you enter an Insert size of 300, the Assembly Log will list the value as 0 to 450. This convention does not impact assembly results. User's Guide to DNASTAR SeqMan NGen 12 Input Sequence Files 39

40 Name pattern In order for NGen to identify Sanger pairs using a sequence naming convention, the convention must systematically distinguish between different pair reads while specifying which pair reads are associated. Forward and reverse sequences must have identical names except for the unique portion that determines the direction of the clone. If applicable, select one of the following predefined file naming patterns from the Name Pattern dropdown list: o sample_f.abi < > sample_r.abi o sample100.f_abc.abi < > sample100.r_abc.abi o sample_n100.abi <: > sample_f100.abi o SAMPL0D1234.abi <: > SAMPL0E1234.abi If none of the predefined patterns matches your file naming convention, you may select Custom Pair Specifier from the dropdown list, and then manually enter the appropriate expressions for Forward and Reverse naming conventions. Note: Naming conventions should use a subset of regular expressions which utilize elements of the Grep language. For more information, see Example Regular Expressions. Once you are finished, click OK to return to the Input Sequence Files dialog. Set Pair Information (All Others) In the Input Sequence Files dialog, the following version of the Set Pair Information dialog pops up automatically when you fulfill either of these criteria: Normal templated workflow + any read technology + paired data Any workflow except normal templated + non-sanger read technology + paired data Note: If you are doing any workflow other than normal templated and are using Sanger data, a different version of the dialog appears. See Set Pair Information (Certain Sanger Data) for details. 40 Input Sequence Files User's Guide to DNASTAR SeqMan NGen 12

Discard reads without linkers This option only appears in the dialog if you chose Templated assembly special workflows in the Choose Assembly Type screen, and specified a Read Technology of Ion

41 Insert size This box may originally be blank or may contain a changeable default value, depending on the read technology you chose. Enter the anticipated distance between paired end reads across the library. SeqMan NGen will use this value to automatically calculate the minimum and maximum insert distances. Discard reads without linkers This option only appears in the dialog if you chose Templated assembly special workflows in the Choose Assembly Type screen, and specified a Read Technology of Ion Torrent in the Input Sequence Files screen. If you input an Insert size and leave the box checked, clicking OK will launch the following dialog. Choose between Standard Linker and Custom Linker. If you choose the latter, you must paste or type the junction linker in the box provided. Click OK to return to the Input Sequence Files dialog. The following information should be considered when making choices in the Set Pair Information dialog: User's Guide to DNASTAR SeqMan NGen 12 Input Sequence Files 41

42 During assembly, SeqMan NGen lists this value as a range. If, for example, you enter an Insert size of 300, the Assembly Log will list the value as 0 to 450. This convention does not impact assembly results. For short inserts containing fewer than 1000 bases, SeqMan NGen sets the minimum size to 0 to catch smaller outliers, which tend to be common. For larger inserts, it sets the minimum to half of the Insert size, with the exception of Illumina data, which is set to 0. Long insert Illumina reads have a minimum of 0 because only half the reads consist of long inserts. The other half consist of short inserts (~300 bp), with the short inserts pointing towards one other, and the long inserts pointing away. The 0 value is used by SeqMan NGen s small genome assembler as a flag to account for the undetermined insert size. If you specified Ion Torrent read technology in the Input Sequence Files dialog and enter a value of in the Set Pair Information dialog, SeqMan NGen assumes the library is paired end (small insert). For values 800, the library is assumed to be mate pair (long insert). Using Paired End Data Note: The following information does not apply to the normal templated workflow. Paired end reads are typically in two files with the forward reads in one file and the reverse reads in the other. SeqMan NGen assumes the pair will be from opposite ends of the same DNA fragment, and sequenced from the end of the fragment inwards. To add paired reads, go to the Input Sequence Files dialog and Add your read files to the lower pane ( Set up paired reads ). To enable SeqMan NGen to identify pairs, a sequence naming convention must systematically distinguish between different pair reads while specifying which pair reads are associated. Forward and reverse sequences must have identical names except for the unique portion that determines the direction of the clone. Expressions for these naming conventions are created using a subset of regular expressions, which utilize elements of the Grep language. The following rules apply: Two parallel files must use standard naming convention (e.g. s_7_1_sequence and s_7_2_sequence). Forward and reverse reads must be in exactly the same order in the two files. Both forward and reverse reads must be present for every pair, including pairs where one of the reads failed or is of very low quality. 42 Input Sequence Files User's Guide to DNASTAR SeqMan NGen 12

43 As an example, forward and reverse Sanger pair files are named as follows: 01f.abi and 01r.abi, where 01 distinguishes that they are members of the same pair. The f and r at the end of each sequence name distinguishes the orientation. In Grep, the naming convention would be written as follows: Forward convention: (.*)f\..*$ Reverse convention: (.*)r\..*$ Note: For more information on Grep name patterns, see Example Regular Expressions. See the links below for read technology-specific information about using paired-end data. Illumina Pairs Paired end reads are typically in two files, or a small number of files if they are from multiple runs or lanes. These pairs are specified by a naming convention used in the *.fasta file comment line. For SNG assemblies (called SMNG in Linux) with paired end reads, SeqMan NGen automatically adds the following information to the script: setpairspecifier pairs: { { } } forward: (.*)/1 reverse: (.*)/2 min: 0 max: 750 key: Illumina If reads do not match one of the pair specifiers, or if the forward and reverse specifiers are represented by empty strings (""), SNG will attempt to match using the whole name of the sequence. If exactly two reads have the same name, they will be considered a match. For XNG assemblies, SeqMan NGen adds the following information: { is Pair: true file: "****" User's Guide to DNASTAR SeqMan NGen 12 Input Sequence Files 43

44 SeqTech: "Illumina" mindist: 0 maxdist: 750 } For XNG assemblies with paired-end reads, SeqMan NGen recognizes the pairs by their file names. The following examples demonstrate some of the filename formats that SeqMan NGen supports for XNG pairs. Large-bold text in the examples is used to highlight the region of each filename that specifies the forward and reverse reads: "R_2011_11_21_11_06_08_user_C29-100_PE_DH10B_11_Auto_C29-100_PE_DH10B_11_4120_reverse_pe2.fastq", "R_2011_11_21_11_06_08_user_C29-100_PE_DH10B_11_Auto_C29-100_PE_DH10B_11_4120_forward_pe1.fastq", "Strain1234_L7_R1_ATCACG_Index1.fastq", "Strain1234_L7_R2_ATCACG_Index1.fastq", "K12-1-B_TGACCA_L006_R1.fastq", "K12-1-B_TGACCA_L006_R2.fastq", "GBBC920_GGCTAC_L008_R1.filt.50bp.fastq", "GBBC920_GGCTAC_L008_R2.filt.50bp.fastq" "tiny_1.txt", "tiny_2.txt", "tiny_1_sequence.txt", "tiny_2_sequence.txt", "tiny1._qseq", "tiny2._qseq", "s_1_1_sequence.txt" "s_1_2_sequence.txt" "C29-129_forward_pe1.fastq" 44 Input Sequence Files User's Guide to DNASTAR SeqMan NGen 12

45 "C29-129_forward_pe2.fastq" The Grep used to match the pairfilenames is shown below: "(?'name'.*?)_r1_(?'ext'.*)\\.fastq", "(?'name'.*?)_r2_(?'ext'.*)\\.fastq", "(?'name'.*?)_r1\\.(?'ext'.*)\\.fastq", "(?'name'.*?)_r2\\.(?'ext'.*)\\.fastq", "(?'name'.*?)_forward_pe1(?'ext_p'\\.fastq)", "(?'name'.*?)_reverse_pe2(?'ext_p'\\.fastq)", "(?'name'.*?)1\\.fastq", "(?'name'.*?)2\\.fastq", "(?'name'.*?)1_sequence\\.txt", "(?'name'.*?)2_sequence\\.txt", "(?'name'.*?)1\\.txt", "(?'name'.*?)2\\.txt", "(?'name'.*?)1\\._qseq", "(?'name'.*?)2\\._qseq", The following script command can be used to add support for a new filename format. The command must be executed before assembly. The pattern will be used for all subsequent assembletemplate commands for that run of XNG pairfilepattern forward: "(?'name'.*?)_r1_(?'ext'.*)\.fastq" reverse: "(?'name'.*?)_r2_(?'ext'.*)\.fastq" Roche 454 Pairs Paired end reads are provided as a single read containing the pair joined by a linker sequence. When assembling 454 paired end reads, SeqMan NGen will check for the presence of a linker defining the paired end reads. Reads with an identifiable linker are split into forward and reverse reads with the forward read flipped so the traditional orientation is maintained. These reads are User's Guide to DNASTAR SeqMan NGen 12 Input Sequence Files 45

46 then put into parallel fastq files. SeqMan NGen appends each file name with _1 or _2, following the Illumina paired end convention. The read names themselves are appended with for or rev. In cases where the linker occurs at the end of the read, the linker is removed and a single end read is placed in a file with _unpaired appended to the name. Reads where no read is detected are also placed in the _unpaired file. 454 paired end splitting can also be specified through scripting. Sanger Pairs Paired end reads are typically all in multiple files with the forward pairs having an f or forward in the name and the reverse pairs having r or reverse in the name. Example Regular Expressions Examples of expressions you may find useful regarding paired end naming specifications follow. Please note this is not a complete list of regular expressions, and the definitions of the terms used are limited to their application to SeqMan NGen paired end naming specifications. Special Characters [ ] Character class--used to enclose a list of alternatives \ A switch that makes special characters literal and literal characters special Grouping--used to delimit a string comprising a phrase. Phrases are necessary in paired end specification so you can match a pair of forward and ( ) reverse reads while still distinguishing their orientation. In SeqMan NGen, phrases in parentheses must match for two reads to qualify as a pair; phrases outside the parentheses are used to distinguish members of the same pair. \d Any digit (0-9) \D Any non-digit character \w Any alphanumeric word character (including _ ). Any character Alternate--either the term before or after ^ Match at the beginning of the line only $ Match at the end of the line only Numerical Modifiers * 0 or more + 1 or more? 1 or 0 {n} Exactly n {n,} At least n {n,m} At least n but not more than m 46 Input Sequence Files User's Guide to DNASTAR SeqMan NGen 12

47 Example Expressions and Their Meanings d Literally the letter d \d Any digit (0-9) \d* Zero or more digits \d+ One or more digits A phrase comprising one or more digits--same as \d+, but causes SeqMan (\d+) NGen to match the names from the string inside the phrase when other characters in the name may not match. \. Literally the period symbol (.). Any character.+ One or more of any characters.* Zero or more of any characters a b a OR b ab[i1] abi or ab1 abi$ Ends with abi [\.\d] A period OR a digit [abc] a OR b OR c [abc]+ One or more characters from the set a, b, c.*f Any number of any characters followed by the letter f A phrase comprising any number of any characters, followed by the letter f -- (.*)f same as *.*f, but causes SeqMan NGen to match the phrase in parentheses without matching the f in a read name (\D+)r(\d+) One or more non-digit characters followed by r followed by one or more digits. (\d{2,4})f(\.abi) Two, three or four digits followed by f followed by *.abi Set Up Experiments If you chose Templated assemblies with control from the Choose Assembly Type screen, you were prompted to enter experiment names in the Input Sequence Files dialog. The screen that follows in this situation is Set Up Experiments, which allows you to enter further information about controls and to specify a VCF file. User's Guide to DNASTAR SeqMan NGen 12 Set Up Experiments 47

48 The upper part of this wizard screen contains four columns: 48 Set Up Experiments User's Guide to DNASTAR SeqMan NGen 12

49 Column Experiment Description This column is pre-loaded with the experiment names specified in the previous screen. If you wish to edit an experiment name, you must go back to that screen (Input Sequence Files) using the < Back button and edit them there. Use the checkboxes to indicate which experiments are controls. When a box is checked, the adjacent Control Type cell changes from None to <SELECT>. Is Control By default, each Control Type is listed as None, signifying that the experiment is not a control. If you check a box in the Is Control column, the Control Type changes from None to <SELECT>. Click on any cell in the Control Type column to activate a dropdown menu from which you can select one of three options: Control Type VCF File None Not a control. Baseline A normal tissue control used in combination with diseased (e.g., tumor) tissue from the same individual. If you select Baseline, and the Is Control box was not previously checked, it will be checked automatically. Validation A control sample for assessing the quality and accuracy of the capture, sequencing, assembly and variant calling. For example, highly curated variant files of the HapMap NA12878 genome are available from the Genome in a Bottle consortium, with corresponding reference materials available from the Coriell Institute. Validity of the completed assembly is performed within DNASTAR s ArrayStar application. See the ArrayStar online help for details. If you select Validation, and the Is Control box was not previously checked, it will be checked automatically. If you specify a Validation control, you must specify a VCF file for that experiment before you can proceed to the next wizard screen. Otherwise, an error message will appear: A VCF file must be specified for the experiment marked as a validation control. Select any cell in this column and then click the Set VCF File button to specify a VCF file. See The Set VCF File button section just below this table. User's Guide to DNASTAR SeqMan NGen 12 Set Up Experiments 49

50 The Set VCF File button: This button is used to launch a file browser from which you may select a file in VCF format. To associate a VCF file with a particular experiment or control, select the empty cell in the VCF File column before using the button. However, the button may also be used without first selecting a cell. After you select the VCF file, the Associate VCF File dialog opens, offering three options. VCF file for validation control experiment VCF file for non-validation experiments VCF file for all experiments The option selected in this dialog overrides the cell (if any) that was selected in the VCF File column. For example, if you select a VCF File cell in the table corresponding to a Validation control sample, but then choose VCF file for non-validation experiments, the non-validation experiments cells in the table will be populated with the selected VCF file. The Clear VCF File button: To remove a VCF file from a cell in the VCF File column, select the cell and then click Clear VCF File. Input BAM Layout File Click the Browse button in this dialog to choose the BAM reference sequence (*.bam). 50 Input BAM Layout File User's Guide to DNASTAR SeqMan NGen 12

51 Once you are finished, click Next > to continue to the next wizard screen. Read Options The Read Options dialog displays the parameters used for running pre-assembly scans and allows you to adjust their values. This dialog is only available for de novo and special templated workflows. User's Guide to DNASTAR SeqMan NGen 12 Read Options 51

52 Maximum Total Reads By default, this box is checked and the number 10,000,000 is entered. When using Illumina technology, we recommend leaving the box checked and specifying a value to limit the number of reads used in the assembly. For 454 technologies, we suggest unchecking the box and leaving the field blank. Note: If you check Maximum Total Reads, be sure to add individual read files rather than folders in the Input Sequence Files dialog. Adding files individually causes SeqMan NGen to use an equal amount of reads from each file. If you instead add a folder, SeqMan NGen may potentially use reads from only the first file(s). Specify whether you would like SeqMan NGen to perform any of the following pre-assembly tasks: Quality end trim To automatically trim reads prior to assembly based on quality scores and specified quality end trimming parameters. Vector/adapter scan To use specified vector/adaptor scan parameters to scan and trim reads for the vector or adapter. 52 Read Options User's Guide to DNASTAR SeqMan NGen 12

Contaminant scan To use specified contaminant scan parameters to scan and remove reads that contain contaminant sequences. This option is not available in the Metagenomics/16S rrna workflow.

53 Contaminant scan To use specified contaminant scan parameters to scan and remove reads that contain contaminant sequences. This option is not available in the Metagenomics/16S rrna workflow. Repeat scan To use specified repeat scan parameters to scan reads for known repetitive sequences. All sequences identified as repeats will be added to the assembly last, after all non-repeats have been assembled. Click an Add button to the right of these last three options to select the desired vector, contaminant, or repetitive sequence(s) from the corresponding Files and Folders dialog. To edit options for any of the above tasks, or to set up parameters for fixed end trimming, click the Advanced Trim/Scan Options button to open the Advanced Trim/Scan Options dialog. The option for fixed end trimming is only accessible by clicking this button. Once you are finished, click Next > to continue to the next wizard screen. Files and Folders Dialogs The Read Options dialog allows you to access Vector, Contaminant, and Repeat Files and Folders dialogs via the three associated Add buttons. These nearly identical Files and Folders dialogs are used to add files for the functions of vector trimming, contaminant scanning, and removal of known repeats. Select File Click to navigate to and select individual sequence(s). See the SeqMan NGen Supported File Types page for supported file formats Select Folder Click to navigate to and select an entire folder of sequences. Clone site (Vector dialog only) Enter the position of the cloning site where insertion occurs. User's Guide to DNASTAR SeqMan NGen 12 Read Options 53

54 Remove Click to remove a selected (highlighted) file from the list. Once you are finished, click OK to save your changes and return to the Read Options dialog, or Cancel to discard changes before returning. Advanced Trim/Scan Options From the Read Options dialog, clicking the Advanced Trim/Scan Options button brings you to the Advanced Trim/Scan Options dialog. This dialog allows you to view and modify trimming parameters and vector, repeat and contaminant scanning parameters. 54 Read Options User's Guide to DNASTAR SeqMan NGen 12

Quality End Trimming Settings: Minimum quality The minimum averaged quality score of the evaluated window that is required in order to be considered low-quality.

55 Quality End Trimming Settings: Minimum quality The minimum averaged quality score of the evaluated window that is required in order to be considered low-quality. Window The length of the window to be used for averaging quality scores. Fixed End Trimming Settings: Do fixed end trimming Check this box to implement pre-assembly fixed end data trimming. Enter the number of base pairs you wish to trim in the 5 trim and 3 trim fields. Note: The values entered for 5 trim and 3 trim are used differently, depending on whether 3 value is measured from 5 end is selected. If it is not selected, then the 5 trim and 3 trim values will indicate the number of bases for SeqMan NGen to trim from the respective ends of each read. If it is selected, then the 5 trim and 3 trim values will indicate the specific coordinates to which reads should be trimmed. User's Guide to DNASTAR SeqMan NGen 12 Read Options 55

56 Other End Trimming Options: Trim to mer Check this box to trim the reads to the matching mer within the read. For each read, SeqMan NGen looks for mers that exist in the template (for templated assemblies) or in any other read in the assembly (for de novo assemblies). It then sets the trimming for the read to the start of the first mer found and the end of the last mer found. Trimming to mer may be useful when assembling data without accurate quality scores or data with very short linkers. Vector/Adapter Scan Settings: Mer length The minimum length of a mer required to be considered an exact match when searching for vector. Minimum matches The minimum number of matching mers required to start an alignment. Trim length The minimum length required for a mer to be considered as a match for vector trimming. Trim to end The distance to the endpoint where trimming will go all the way to the end of the sequence. Repeat Scan Settings: Mer length The minimum length of a mer required to be considered an exact match when scanning for repeats. Minimum matches The minimum number of matching mers required to be considered a repeat. Flag length The minimum length required for a mer to be flagged as a repeat. Contaminant Scan Settings: Mer length The minimum length of a mer required to be considered an exact match when scanning for contaminants. Minimum matches The minimum number of matching mers required to mark the sequence as a contaminant. Once you are finished, click OK to save your changes and return to the Read Options dialog, or Cancel to discard changes before returning. 56 Read Options User's Guide to DNASTAR SeqMan NGen 12

57 Assembly Options The Assembly Options dialog allows you to specify the parameters to use for your assembly. There are several versions of this dialog depending upon your choices in previous wizard screens. Follow the links below to go to the appropriate help topic for your workflow. Assembly Options (BAM Layout) The Assembly Options dialog allows you to specify the parameters to use for your assembly. If you are following the BAM layout workflow, the following version of the dialog appears. Select the type of Genome ploidy for your project. Choosing Haploid or Diploid establishes the statistical model SeqMan NGen will use in estimating probabilities during SNP calls. Selecting Population / other (e.g. for a polyploid genome) causes SeqMan NGen not to calculate probabilities. If desired, click the Advanced Assembly Options button to open the Advanced Assembly Options dialog. This dialog allows you to view and edit additional assembly parameters. Once you are finished, click Next > to continue to the next wizard screen. User's Guide to DNASTAR SeqMan NGen 12 Assembly Options 57

58 Assembly Options (De Novo, Special Templated) The Assembly Options dialog allows you to specify the parameters to use for your assembly. If you are following the de novo or special templated workflows, you will see the following version of the dialog. Repeat Handling Checking this box automatically computes a threshold for determining the number of identical subsequences of bases, or mers, used to indicate a putative repeat. (For more information, see the Repeat Handling section.) Note: The Repeat Handling section is not included in this dialog if you are performing a special templated assembly or a transcriptome assembly. 58 Assembly Options User's Guide to DNASTAR SeqMan NGen 12

59 o Expected genome length If you know the approximate length of the genome/fragment being assembled, select this button and specify a length. SeqMan NGen will then calculate the expected average coverage empirically from the amount of data. This, in turn, allows repeat regions to be identified and handled more accurately, resulting in a better assembly. If the approximate genome length is not known, use the Expected coverage option. o Expected coverage If you do not know the length of the genome/fragment, select this button and provide an estimate of the depth of the sequencing. The default value for this field is 20, and the maximum allowable value is 65,535. If you enter a value larger than the maximum, you may receive an error message and be prevented from continuing until you choose a value less than or equal to the maximum. Note: Use caution when estimating the value for Expected coverage. If the value you use is significantly lower than the actual depth, the assembly may take a much longer time to complete and may have too many mers flagged as repeats. We recommend using Expected genome length whenever possible. o Mer size The minimum length of a mer (overlapping region of a fragment read), in bases, required to be considered a match when arranging reads into contigs. Mer size information is used to identify matches during the assembly layout phase. The default mer size is determined by the selected read technology and is shown in the window. For more information, see the Mer Tags section. o Automatic Select this button to automatically set the size based on assembly type and sequencing technology. o Custom Select this button to choose the size yourself. You must enter the desired number of base pairs in the field at right. Lowering the mer size increases the sensitivity of finding matches, but also increases the likelihood of finding spurious matches in addition to the correct match. Lowering the mer size can also greatly increase the requirements for storing intermediate and temporary files with large projects. Minimum match percentage Specifies the minimum percentage of matches in an overlap that are required to include a sequence read in the final alignment. For more information, see the Match Percentage section. o Automatic Select this button to automatically set the percentage based on assembly type and sequencing technology. o Custom Select this button to designate the percentage yourself. You must enter a number in the field at right. Realign reads after assembly Check this box to include a realignment step after the assembly. This step analyzes each sequence at the nucleotide level to determine the exact position of each sequence in the alignment and realigns contigs as needed. For templated assemblies, this option may improve the accuracy of the final assembly by correcting occasional misalignments that can occur in gapped regions. However, this step may significantly increase the time needed to assemble. User's Guide to DNASTAR SeqMan NGen 12 Assembly Options 59

Remove small contigs after assembly Check this box and type values in one or both boxes: o Minimum sequences to disassemble any untemplated contigs with fewer than the specified number of sequences.

60 Remove small contigs after assembly Check this box and type values in one or both boxes: o Minimum sequences to disassemble any untemplated contigs with fewer than the specified number of sequences. o Minimum length to disassemble any untemplated contigs shorter than the specified length. Note: Both options affect only untemplated contigs. No templated contigs will be removed. Genome ploidy Select the type of ploidy for your project. Choosing Haploid or Diploid establishes the statistical model SeqMan NGen will use in estimating probabilities during SNP calls. Selecting Population / other (e.g. for a polyploid genome) causes SeqMan NGen not to calculate probabilities. Note: The Genome ploidy option is only displayed if BAM Format is checked in the Save project as section of the Set Up Project Files dialog. If desired, click the Advanced Assembly Options button to open the Advanced Assembly Options dialog. This dialog allows you to view and edit additional assembly parameters. Or click the SNP Options button to open the SNP Options dialog, where you can change SNP-related parameters. Note: The SNP Options button is only displayed if you checked BAM Format in the Save project as section of the Set Up Project Files dialog. Once you are finished, click Next > to continue to the next wizard screen. Note that if you check Repeat handling without specifying an Expected genome length, you will receive the following error message after clicking Next. Click OK and adjust the dialog parameters before again clicking Next. 60 Assembly Options User's Guide to DNASTAR SeqMan NGen 12

61 Assembly Options (All Others) The Assembly Options dialog allows you to specify the parameters to use for your assembly. If you are following the normal templated, reference-guided or viral-host workflows, the following version of the dialog appears. Mer size The minimum length of a mer (overlapping region of a fragment read), in bases, required to be considered a match when arranging reads into contigs. Mer size information is used to identify matches during the assembly layout phase. The default mer size is determined by the selected read technology and is shown in the window. For more information, see the Mer Tags section. o Automatic Select this button to automatically set the size based on assembly type and sequencing technology. User's Guide to DNASTAR SeqMan NGen 12 Assembly Options 61

62 o Custom Select this button to choose the size yourself. You must enter the desired number of base pairs in the field at right. Lowering the mer size increases the sensitivity of finding matches, but also increases the likelihood of finding spurious matches in addition to the correct match. Lowering the mer size can also greatly increase the requirements for storing intermediate and temporary files with large projects. Minimum match percentage Specifies the minimum percentage of matches in an overlap that are required to join two sequences in the same contig. (For more information, see the Match Percentage section.) o Automatic Select this button to automatically set the percentage based on assembly type and sequencing technology. o Custom Select this button to designate the percentage yourself. You must enter a number in the field at right. Genome ploidy Select the type of ploidy for your project. Choosing Haploid or Diploid establishes the statistical model SeqMan NGen will use in estimating probabilities during SNP calls. Selecting Population / other (e.g. for a polyploid genome) causes SeqMan NGen not to calculate probabilities. Note: The Genome ploidy option is only displayed if you checked BAM Format in the Save project as section of the Set Up Project Files dialog. SNP filter stringency (not available in all workflows) The three radio buttons specify High, Medium or Low stringency levels for soft filtering of SNPs. This means that SNPs of the least interest to you will be automatically hidden when SNP reports/tables are viewed in SeqMan Pro or ArrayStar. However, these filtered SNPs are not removed from the assembly, and can be made visible again by changing the SNP filtering parameters in either SeqMan Pro or ArrayStar. Note: Hard filtering of SNPs can be done through the SNP tab of the Advanced Options dialog. SNP Filter Parameter SNP Filter Stringency High Medium (default) Low Depth P not ref Min SNP% (i.e., > 0) ****************** 62 Assembly Options User's Guide to DNASTAR SeqMan NGen 12

63 If desired, click the Advanced Assembly Options button to open the Advanced Assembly Options dialog. This dialog allows you to view and edit additional assembly parameters. Once you are finished, click Next > to continue to the next wizard screen. Advanced Assembly Options Clicking the Advanced Assembly Options button from the Assembly Options or SNP Options dialog opens an Advanced (Assembly) Options dialog. The dialog may be plain or tabbed, depending upon your choices in previous wizard screens. Follow the links below to go to the appropriate help topic for your workflow. Advanced Options (Normal Templated, Reference-Guided) In normal templated and reference-guided workflows, clicking the Advanced Assembly Options button from the Assembly Options dialog opens a tabbed Advanced (Assembly) Options dialog. There are several versions of this dialog depending upon your choices in previous wizard screens. Layout Options Clicking the Advanced Assembly Options button from certain Assembly Options dialogs opens a tabbed Advanced Options dialog. The Layout tab allows you to view and edit Layout Options. Default parameters vary according to the sequencing technology and project type specified elsewhere in the wizard, and values seldom need to be changed. User's Guide to DNASTAR SeqMan NGen 12 Assembly Options 63

Repeat read placement (currently disabled). Maximum repeat count Enter the maximum number of occurrences for any given mer in the reference sequence for it to be used in matching.

64 Repeat read placement (currently disabled). Maximum repeat count Enter the maximum number of occurrences for any given mer in the reference sequence for it to be used in matching. Mers exceeding this value are flagged as repeats and are not used as mer tags in determining overlaps. Maximum total reads Check the box and enter a value if you wish to limit the read depth. Utilizing this option can make the assembly proceed faster. Assembly output format Use the drop-down menu to choose a format for the assembly output. 64 Assembly Options User's Guide to DNASTAR SeqMan NGen 12

65 o BAM assembly package To save the output as a *.bam file. o SeqMan Pro document (.sqd) To save the output in both *.bam and *.sqd formats. o Unassembled seqs only This option can be used as a filter to remove reads from an unwanted source in a mixed sample (e.g. removing host DNA from a viral sample). Note: If you are doing a reference-guided assembly with gap closure, the only option enabled is *.sqd. Limit deep regions If this box is checked, areas of the assembly where an extreme number of reads ( > 10000) are laid out to the same area of the template will be filtered before alignment. This is not an exact filter and the maximum depth will typically be between 10,000-20,000 reads. This filter can improve performance, sometimes significantly. Alignment Options Clicking the Advanced Assembly Options button from certain Assembly Options dialogs opens a tabbed Advanced Options dialog. The Alignment tab allows you to view and edit Alignment Options for the gapped alignment phase. Default parameters vary according to the sequencing technology and project type specified elsewhere in the wizard, and values seldom need to be changed. User's Guide to DNASTAR SeqMan NGen 12 Assembly Options 65

Enter values for: Minimum aligned length The minimum length of at least one aligned segment of a read after trimming. The default value varies depending on the read technology you selected.

66 Enter values for: Minimum aligned length The minimum length of at least one aligned segment of a read after trimming. The default value varies depending on the read technology you selected. Maximum gap size The maximum number of gaps allowed per 1000 bases in the alignment. Minimum match percentage The minimum percentage of matches in an overlap required to join two sequences in the same contig. SeqMan NGen determines the percentage to use based on the sequencing technology you specified in the Assembly Options dialog. Match score The score for a base match during an alignment. This score contributes to the pairwise score used to calculate match percentage. Increasing this value will allow for longer or more frequent gaps, thus forcing bases that match to be assembled together. 66 Assembly Options User's Guide to DNASTAR SeqMan NGen 12

67 Mismatch penalty The penalty for a base mismatch during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. Gap penalty The penalty for opening a gap during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. A high gap penalty suppresses gapping, while a low value promotes gapping. Gap extension penalty The penalty to the alignment score for extending a new or existing gap by one base. This is in contrast to the gap penalty, which is the penalty to the alignment score for opening up a new gap. Alignment cutoff - Determines if the accumulation of gap openings, gap extensions and mismatches causes the alignment score to drop below the maximum alignment score. If so, the alignment will stop and will be trimmed to the point where the alignment score was at its maximum. Favor 5 gap If this box is checked, insertions and deletions in homopolymeric runs or simple sequence repeats will preferentially occur on the 5 end (top strand) of the run/repeat. The box is checked by default. Auto trim reads If this box is checked, the ends of reads are trimmed to best match alignment to the template. SeqMan NGen will mark the portion of the read that aligns well to the template, and will set the trimming to skip any of the poorly aligning parts of the read. Checking this option optimizes the end trimming of reads to maintain as much of the read as possible, while still meeting the minimum match percentage threshold. However, checking the box can also lead to the removal of true variant bases located near the ends of reads. The box is checked by default. Trim to targeted regions If this box is checked, reads extending beyond the 5 or 3 end of a targeted region will be trimmed to the target boundary. The box is unchecked by default. SNP Options Launch the SNP Options dialog by clicking the Advanced Assembly Options button from certain Assembly Options dialogs (you may then need to click the SNP tab) or the SNP Options button from the Recalculate SNPs dialog. The SNP Options dialog allows you to view and edit options related to SNP calculation. The options chosen in this dialog affect the hard filtering of SNPs. This means that SNPs of the least interest to you will be automatically and permanently removed from the assembly. Note: For information on reversible soft filtering of SNPs, see the description of SNP filter stringency in the topic Assembly Options (All Others). User's Guide to DNASTAR SeqMan NGen 12 Assembly Options 67

Default parameters vary according to the sequencing technology and project type specified elsewhere in the wizard, and values seldom need to be changed.

68 Default parameters vary according to the sequencing technology and project type specified elsewhere in the wizard, and values seldom need to be changed. Calculate SNP s Check to turn on in-line SNP detection. This box is checked by default. SNP calculation method Use the drop-down menu to select the desired calculation method. o Simple percentage To detect only whether the column is most likely to have the same base or a different base from the reference. Reference bases are not reported. The most frequent base in the column which is not the reference is treated as the potential SNP call. The "SNP%" output value is the percentage of the column which corresponds to the base chosen as the alternative to the reference. The direction and quality (weight) of the bases in the column are not considered. You may choose a minimum threshold for this value. 68 Assembly Options User's Guide to DNASTAR SeqMan NGen 12

69 o Diploid bayesian To use a Bayesian statistical model very similar to MAQ (Li et al., 2008, Genome Res. 18:1851) to call SNPs between three potential genotypes: homozygous reference, homozygous variant (some other base), and heterozygous (two bases, which may include the reference). This menu choice is optimized for diploid genomes. Before applying this model, the simple SNP caller is run to more quickly establish a percentage with which the column is screened. If the column passes a minimum percentage screen, it is then checked against a minimum variant depth: the most frequent variant base must meet or exceed this threshold. Putative SNP-containing columns are then evaluated with a statistical model that considers the two most frequent bases in the column as possible alleles. If there is only one base, the reference is used as the other base, regardless of its depth in the column. The model then calculates the P not ref of each set of bases, meaning the probability that they occurred by random chance. This is based on the base frequency, combined frequency of the two bases, the quality scores (weights), and the directions of the reads. A putative SNP base must have at least one read on each strand. The heterozygous call's probability is based on simple permutations and a constant modifier, with the strands considered separately. Since they are the only possible genotypes, probabilities are normalized against one another, and the highest probability is called. o Haploid bayesian (default) A Bayesian method similar to the one above, but optimized for haploid genomes. Enter values for: Minimum SNP percentage The minimum percent of non-reference bases required to call an SNP. When it performs SNP passes, SeqMan NGen will include regions in an assembly that have coverage less than or equal to the specified value. The default value is 5. A nonzero value is recommended when using Ion Torrent data, or working with larger genomes or doing population studies. Very low values will lead to larger files, but do not necessarily result in better SNP calls. P not ref - The minimum SNP quality score (Q call ) required to include a position as a putative SNP. Note: If you chose Cancer / somatic gene panel assembly in the Choose Project Type screen, P not ref is disabled. That workflow uses a simple percentage SNP caller and the P not ref statistic is not calculated. Minimum SNP Count The minimum number of non-reference bases required to call an SNP. When it performs SNP passes, SeqMan NGen will include regions in an assembly that have coverage less than or equal to the specified value. Minimum base quality score The minimum quality score below which a base will not be considered. Check strands Check this box to consider the strandedness of each read during SNP calculation. By default, the box is unchecked. User's Guide to DNASTAR SeqMan NGen 12 Assembly Options 69

Note: Minimum SNP percentage and Minimum SNP Count can be used in tandem to control the number of reportable SNPs, and by extension, the size of the SNP table.

70 Note: Minimum SNP percentage and Minimum SNP Count can be used in tandem to control the number of reportable SNPs, and by extension, the size of the SNP table. Once you are finished, click OK to save changes and return to the previous dialog (either Assembly Options or SNP Options), or Cancel to return without saving changes. Advanced Assembly Options (De Novo) Clicking the Advanced Assembly Options button from the Assembly Options dialog in a de novo workflow opens the Advanced Assembly Options dialog. Default parameters vary according to the sequencing technology and project type specified elsewhere in the wizard, and values seldom need to be changed. Enter values for: Match score The score for a base match during an alignment. This score contributes to the pairwise score used to calculate match percentage. Increasing this value will allow for longer or more frequent gaps, thus forcing bases that match to be assembled together. Mismatch penalty The penalty for a base mismatch during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. Gap penalty The penalty for opening or extending a gap during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. A high gap penalty suppresses gapping, while a low value promotes gapping. 70 Assembly Options User's Guide to DNASTAR SeqMan NGen 12

71 Max gap The maximum number of gaps allowed per 1000 bases in the alignment. SNP passes The number of times SeqMan NGen will cycle through a templated assembly, attempting to fill in regions with zero or low coverage due to SNPs. SNP match percent The minimum match percentage required during passes to fill in SNP regions. The default value will change depending on the type of assembly and the read technology you selected. SNP low cover cutoff The minimum coverage required in an assembly to be excluded from SNP passes. SeqMan NGen will include regions in an assembly that have coverage less than the value specified as well as regions with zero coverage when it performs SNP passes. (See the SNP passes parameter above.) Match window The size of the window used to calculate match percentage. Maximum coverage The maximum depth of coverage allowed in a templated assembly. SeqMan NGen will not exceed the coverage specified by this threshold. The default value of 0 equals unlimited coverage. Note: This parameter is only available for templated assemblies, and should be used with caution, as it will limit the number of sequences included in the assembly. Match repeat percent The percent frequency a mer occurs compared to its expected frequency. Mers exceeding this value are flagged as repeated and not used as mer tags in determining overlaps. Match spacing The length of the window of a sequence read where at least one mer tag will be chosen. The default value will change depending on the read technology you selected. Default quality The value used for the base quality of sequences without quality scores. Default template quality The value used for the base quality of template sequences without quality scores. Max usable Any mers occurring more frequently than the Repeat Handling expected coverage value multiplied by this value are disregarded as mer tags from the assembly. Once you are finished, click OK to save changes and return to the Assembly Options dialog, or Cancel to return without saving changes. Match Percentage By default, SeqMan NGen uses a local match percentage which requires that the match percentage threshold be met in each overlapping window of 50 bases. The size of this window can be adjusted by specifying a different value for the match window parameter. An example containing a repeated region follows. User's Guide to DNASTAR SeqMan NGen 12 Assembly Options 71

72 A genome fragment has repeated regions labeled A and A, and two unique regions labeled B and C. When the fragment is sequenced, one of the sequences contains parts of regions A and B, and another contains parts of regions A and C: In this example, a minimum match percentage of 80% is used. When the two sequences are aligned, the 400 bases in the overlapping A and A regions match 100%. The 200 bases in the overlapping B and C regions match 42%. Over the entire alignment, 484 out of 600 bases match, yielding a global match percentage of 81%. However, SeqMan NGen checks the match percentage for every alignment of 50 bases. The alignment below shows the last 36 overlapping bases of A and A and the first 18 overlapping bases of B and C. Each mismatch in the overlap is marked by an X below the alignment. In the first 50 bases shown, there are 41 matches, and the match percentage is 82%. This is above the threshold of 80%, so the match percentage of the next 50 bases is checked and is also found to be 82%. Each fifty bases are checked along the overlap as long as the match percentage is at or above the threshold. In this case, the alignment fails once it gets far enough into the overlap of the unique regions, B and C, that the match percentage drops to 78%. The sequences will not be assembled together into a contig, which is correct for this data set. Mer Tags The SeqMan NGen layout algorithm relies on unique subsequences of bases, or mers, which occur in overlapping regions of fragment reads. Mers that are common to two or more fragment reads are aligned to determine the overall layout of reads. Overlapping reads have many mers in common, but only a few mers per overlapping region are needed to identify the overlap. These mers are called mer tags. The use of mers to tag fragments and identify overlaps is illustrated in the following figure: 72 Assembly Options User's Guide to DNASTAR SeqMan NGen 12

73 Note: As shown in the above figure, a 54bp original DNA sequence is covered by five overlapping fragment reads. The 6-mer tags for each fragment read are underlined. Matching mer tags are aligned to determine the layout of the reads. The power of using mer tags relies on the ability of SeqMan NGen to choose mers that are most likely to occur only once in the original DNA sequence. It is important to avoid choosing mers that occur in repeated regions since the result may be fragment reads that are incorrectly aligned together. Three parameters are involved in choosing mer tags: Match Size, Repeat Handling, and Match Spacing. All of these parameters can be adjusted in the Advanced assembly options dialog. The Match Size and Repeat Handling parameters help to choose tags that are most likely to be unique in the original DNA sequence. Match Size sets the length of the mers. The longer the mer, the higher the probability that it is unique. Repeat Handling parameters help to identify which mers are not likely to be unique. If a mer occurs more often than expected in the dataset, the mer may be part of a repeated region. Match Spacing specifies the preferred distance between mer tags. The smaller the Match Spacing parameter value, the more memory and more time the assembly will take. If a fragment read is shorter than the Match Spacing value, multiple mer tags are still chosen for the read. Note: During assembly, any given read will only be assigned to one contig, even if it matches the hit criteria for more than one contig. If there is no information linking the read to a specific contig (e.g. a unique SNP or a paired-end constraint), SeqMan NGen will assign the sequence randomly to one of the contigs for which it meets the criteria. User's Guide to DNASTAR SeqMan NGen 12 Assembly Options 73

Advanced Options (BAM Layout) Clicking the Advanced Assembly Options button from the Assembly Options (BAM Layout) dialog opens the Advanced Options dialog.

74 Advanced Options (BAM Layout) Clicking the Advanced Assembly Options button from the Assembly Options (BAM Layout) dialog opens the Advanced Options dialog. For information about this dialog, see the topics Alignment Options and SNP Options. Once you are finished, click OK to save changes and return to the Assembly Options (BAM Layout) dialog, or Cancel to return without saving changes. SNP Options Dialog Clicking the SNP Options button from the Recalculate SNPs dialog opens the SNP Options dialog. 74 Assembly Options User's Guide to DNASTAR SeqMan NGen 12

75 For more information, refer to the SNP Options tab of the Advanced Options dialog. Both SNP Options dialogs are identical except that one is a standalone dialog, while the other is part of a tabbed dialog. Once you are finished, click OK to save changes and return to the Recalculate SNPs dialog, or Cancel to return without saving changes. The Your assembly is ready to begin Dialog This dialog is the final pre-assembly wizard dialog for all workflows: User's Guide to DNASTAR SeqMan NGen 12 The Your assembly is ready to begin Dialog 75

The main part of this dialog shows the current script: a snapshot of the assembly set up and parameters. Note: This text in this dialog is not editable.

76 The main part of this dialog shows the current script: a snapshot of the assembly set up and parameters. Note: This text in this dialog is not editable. Changes can be made by returning to a previous page of the wizard and making alterations there. If available, you may check Show All Parameters to view all parameters, rather than only the user-edited parameters. This can be useful if you want to keep a record of all of the parameter values used for an assembly. This checkbox is only present for certain types of workflows (e.g., De Novo, Special Templated). Click Save to save your project and convert your wizard choices into a SeqMan NGen assembly script (*.script). The resulting assembly script is an editable text file that can be modified and rerun if desired. 76 The Your assembly is ready to begin Dialog User's Guide to DNASTAR SeqMan NGen 12

77 Note: When you Save after having checked the Run as separate projects box in the Input Sequence Files screen, a set of three separate scripts is saved for the project. If you save one or more of these scripts to a location other than the main project folder, any attempt to run the assemblies from the SeqMan NGen project script will fail. Moving the projects back to the main project folder will allow assembly to proceed. Click the Assemble button to activate the script. The Assembly Log will open, displaying the status of the assembly. The Assembly Log After pressing the Assemble button from the Your assembly is ready to begin dialog, the Assembly Log opens, showing the status of the assembly. Once assembly has successfully completed, the following text will be displayed: "Assembly process has finished. The Assembly Log has two buttons: Export Log exports the progress information as a text file. A save dialog will open prompting you to choose a location in which to save the file. Note for Windows users: To open a text report with the correct formatting displayed, we recommend using Wordpad, Notepad++, or Microsoft Excel, and not the default Windows text editor, Notepad. User's Guide to DNASTAR SeqMan NGen 12 The Your assembly is ready to begin Dialog 77

78 Stop Assembly aborts an assembly that is still in progress. Additional ways to halt the assembly include clicking Ctrl+C (Win) or Cmd+C (Mac). After halting assembly, you will see the message Assembly process was not successful! Whether the process finishes on its own or is stopped manually, click Next to proceed to the Project Report dialog. The Project Report Dialog After assembly has finished in the Assembly Log, you will automatically be transported to the Project Report dialog. If assembly failed, the dialog displays the message Assembly failed. No report available. Otherwise, you will see the Assembly Report information in the body of the screen. Between two and four buttons are displayed in the row under the report. The availability of a particular button depends on the workflow, and sometimes on the type of machine being used (e.g., Linux vs. Windows). 78 The Project Report Dialog User's Guide to DNASTAR SeqMan NGen 12

79 Reveal Files To open the folder where the assembly output and associated files are stored. Launch in SeqMan Pro To launch the completed assembly in SeqMan Pro. This button is not available for assemblies performed using Linux. Instead, you will need to move the completed assembly to a Windows or Macintosh computer in order to view the assembly in SeqMan Pro. Launch in ArrayStar To launch ArrayStar with the selected project open. This button is not available on Macintosh systems or for Metagenomics/16S rrna or Viral - Host Integration workflows. Validate SNPs - To launch ArrayStar with the selected project open and to automatically run the SNP validity tests (i.e., Statistics > Validation Control Accuracy). This button is only available for the Templated assemblies with control workflow. Note: In the case of multiple assembly projects, some of the buttons below will open a list of the projects. Choose the one you wish to open in SeqMan Pro or ArrayStar, and then click OK. In addition to the usual Help and < Back buttons at the bottom of the dialog, there is one button that is unique to this dialog. The Quit button can be used to close SeqMan NGen, whether or not the assembly completed successfully. The Assembly Report A post-assembly text report is viewable in the Project Report dialog. This report summarizes your assembly statistics, including the parameters used, the number of assembled/unassembled sequences and contigs in your project, and the average quality scores. If you do any of the following, the report will be exported as a text file: Save the assembly in SeqMan Pro format (*.sqd) in the Set Up Project Files dialog. Check Save Report in the Set Up Project Files dialog. Note for all users: The same information contained within this report is also saved within each SeqMan Project file (*.sqd) regardless of whether you choose to export the report by setting this parameter. The report can be viewed in SeqMan Pro by going to Project > Report. Note for Windows users: To open a text report with the correct formatting displayed, we recommend using Wordpad, Notepad++, or Microsoft Excel, and not the default Windows text editor, Notepad. Some of the terms used in the Assembly Report are defined below: User's Guide to DNASTAR SeqMan NGen 12 The Project Report Dialog 79

80 Assembly Totals Contigs Contigs > 2K Contigs to Reach Genome Length x Assembled Sequences Unassembled Sequences All Sequences Contig N50 Average Coverage Average Totals Sequences Per Contig Average Lengths Contigs Assembled Sequences Unassembled Sequences All Sequences Average Quality Assembled Sequences Unassembled Sequences All Sequences Assembly Parameters Total number of contigs assembled. Total number of assembled contigs that are more than 2000 base pairs in length. Number of contigs needed to cover the genome length specified in the Workflow pane. Number of sequences utilized in the assembly. Number of sequences excluded from the assembly. Total number of sequences in the project. Contig size at which 50% of the sequence data are represented.* Average depth of coverage in the assembly. Average number of sequences used for each contig. Average contig length. Average length of sequences used in the assembly. Average length of sequences excluded from the assembly. Average length of all sequences in the project. Average quality score of sequences used in the assembly. Average quality score of sequences excluded from the assembly. Average quality score of all sequences in the project. The values specified in the Workflow, Reads, Controls and Actions tabs prior to assembly. *In a typical microbial genome assembly, Contig N50 values exceed 80K base pairs and genome coverage is attained in less than 100 contigs. In many assemblies, contig N50 exceeds 100K with genome coverage attained in 25 contigs. If paired-end Roche 454 Life Sciences data are used, contigs can be ordered into a handful of large scaffolds to attain genome coverage that greatly facilitates gap closure and completion of the genome assembly. Output Files for Different Workflows The output file structure varies depending upon your workflow and on the assembler used for that workflow. SeqMan NGen uses two powerful assemblers: XNG and SNG (called SMNG in Linux). 80 The Project Report Dialog User's Guide to DNASTAR SeqMan NGen 12

81 The XNG assembler (patent pending) is used for all templated assemblies, including reference-guided assembly. This assembler features an algorithm for fast, accurate assembly of extremely large genomes, and creates BAM-based outputs (e.g., *.assembly files). The SNG/SMNG assembler is used in both reference-guided assembly and de novo assembly. The SNG/SMNG assembler generates finished assemblies in any of four formats: SeqMan Pro (SQD), ACE, SAM or BAM. CPU usage Note: The XNG assembler uses multiple cores, but the exact number varies over the course of the assembly. The SNG/SMNG assembler uses one core during assembly. Click the links below to see a list of the output files for a given workflow. XNG Workflow Output This topic describes the outputs of XNG workflows. These include: Templated workflows. BAM alignment workflow (Welcome screen = Import BAM file; BAM Import = Align BAM layout file). SNP recalculation workflow (Welcome screen = Import BAM file; BAM Import = Recalculate SNPs). Reference-guided assembly with gap closure (Choose Project Type = Genome assembly; Choose Assembly Type dialog = Reference-guided assembly with gap closure). This workflow uses both the XNG and SNG assemblers, but the output files are most similar to XNG outputs. Each workflow varies in the number and contents of output files and folders. Only a subset of items in the table below may appear for a particular workflow. Note: In the table below, it should be understood that the project name precedes any hyphen (-) or period (.) used at the beginning of file and folder names. User's Guide to DNASTAR SeqMan NGen 12 The Project Report Dialog 81

82 Single assemblies The project folder has the name specified in Set Up Project Files and contains:.script file (if saved in the Your assembly is ready to begin dialog)..assembly folder -nosplit.assembly folder (Reference-guided assembly with gap closure workflows only) -Reports folder -zinternal folder info folder All gene panel projects Multiple sample assemblies run as separate assemblies The project folder_d2hlink_49587 contains the name specified in Set Up Project Files followed by the suffix _assemblies. This folder contains:.script file (if saved in the Your assembly is ready to begin dialog). Results.txt file - Overview information and statistics for each assembly..table.txt file.template.script file _arstar.script file A script to load all assemblies as a SNP project in ArrayStar. _arstarvalidation.script file (only if validation control was present) A script to load the validation control assembly and associated VCF file as a SNP project in ArrayStar and to automatically calculate the accuracy statistics..assembly folders (one per sample) -Reports folders (one per sample) -zinternal folder info folder Contents of the.assembly Folder Note: The contents of the -nosplit.assembly folder are similar to those of the.assembly folder. The.assembly folder is part of the output for XNG workflows. In the table below, it should be understood that the project name precedes any hyphen (-) or period (.) used at the beginning of file and folder names. 82 The Project Report Dialog User's Guide to DNASTAR SeqMan NGen 12

83 File Extension Description It is intended that the entire.assembly folder be opened in SeqMan Pro for viewing and analysis of the assembly. However, the following individual files also contain useful information..vcf The VCF file for the assembly, if one was specified..bed,.txt, etc. The target region file (.bed or manifest) for the assembly, if one was specified..templateinfo Contains general information for each contig in the assembly..enrichment_summary.txt Contains the textual information for the Project > Show coverage of target regions option in SeqMan Pro. This file is only created when the *.assembly is first opened in *.sqd SeqMan Pro. It contains saved display specific information such as SNP filtering criteria. Doubling clicking on this file will open the *.assembly package in SeqMan Pro. There is normally no reason to open the following files..auxpair (internal use only).bam The BAM formatted alignment file..bam.bai The BAM index file..capture.usersnp.vcf (internal use only).combined.snpext (internal use only).coverage Contains information at each position along the contig where the coverage changes..coverage2 Contains information for the maximum coverage of 100 base pair intervals across the contig..coverage4 Contains information for the maximum coverage of 10,000 base pair intervals across the contig..coverage.missingsnp Contains information about positions in dbsnp that had coverage and were called the reference base in the assembly..exomecapture-features (internal use only).info Contains information used by SeqMan Pro in displaying the assembly..midinfo (internal use only) missing.fas A fasta file of reads with no mers matching the reference. missing.fas.qual A base quality file of reads with no mers matching the reference..nocoverage.missingsnp Contains information about positions in dbsnp that had no coverage in the assembly. outoforder.txt A text file of sequence reads not included in the final assembly due to excessive trimming during the alignment phase..pair (internal use only) User's Guide to DNASTAR SeqMan NGen 12 The Project Report Dialog 83

84 .pairdist pairspecifiers.txt poor.fas poor.fas.qual.quant.region_capture.bed report.txt.snp.snpext SNPs.log.splitExt.template-comment.template-features.template-features2.template.fof.template-gapped-seq.template-gaps.template-seq unaligned.fas.qual Contains information about the position and distance between paired end reads. (internal use only) A fasta file of reads rejected at the layout phase due to match scores below the threshold. A base quality file of reads rejected at the layout phase due to match scores below the threshold. Reprises information in the.coverage4.coverage2 and /or.coverage files. (internal use only) Contains the textual information for the Project > Report option in SeqMan Pro. Contains all the information for SNPs called using the Simple method. Contains all the information for SNPs called using either the Diploid or Haploid method. An optional text form of the.snpext table that contains information on how each was calculated. If you encounter a problem, this file is useful for DNASTAR Support to help you with trouble-shooting. (internal use only) Contains the comment information for that contig. Contains the feature information for that contig. (internal use only) A file-of-files containing the path and file names of the reference sequences. A.seq file of the template containing gaps. A binary file of the template gap information. A.seq file of the template without gaps. A base quality file of reads rejected at the alignment phase. Note for Windows users: To open text reports with the correct formatting displayed, we recommend using Wordpad, Notepad++, or Microsoft Excel, and not the default Windows text editor, Notepad. 84 The Project Report Dialog User's Guide to DNASTAR SeqMan NGen 12

85 Contents of the -Reports Folder The -Reports folder is part of the output for XNG workflows. In the table below, it should be understood that the project name precedes any hyphen (-) or period (.) used at the beginning of file and folder names. File Suffix or Extension -zinternal -enrichment_summary.txt -pertemplateresults.txt -projectreport.txt -unassembled.fastq Description (click link at left for details) (internal use only) Overview information and assembly statistics per contig. Overview information of the assembly. The same report can be viewed within SeqMan Pro using the Project > Report menu command. The unassembled reads from the assembly in Fastq format. If production of this file is not specified in the script, three files are created instead: missing.fastq unassembled reads with no hits to any template. poor.fastq - unassembled reads with scores too low to include in the layout. unaligned.fastq - unassembled reads included in the layout, but rejected by the aligner. Contents of the -zinternal Folder The -zinternal folder is part of the output for XNG workflows. In the table below, it should be understood that the project name precedes any hyphen (-) or period (.) used at the beginning of file and folder names. User's Guide to DNASTAR SeqMan NGen 12 The Project Report Dialog 85

86 File Suffix or Extension Description info (click link at left for details) bamtosqd.script for converting the assembly to.sqd format. pairscheme.info (internal use only) results.txt source of _pertemplateresults.txt. The following files instruct SeqMan NGen to convert unassembled reads into a separate SQD project: batchunassembled An UNIX executable file for de novo assembly of unassembled reads with v. and vi. A table of values for running an SNG assembly (called batchunassembled.table.txt SMNG in Linux) of the missing.fas reads against the template sequence. batchunassembled.template.script The SNG (called SMNG on Linux) script containing variables that are specified by the batchmissing.table.txt. Contents of the info Folder The info folder is part of the output for XNG workflows. In the table below, it should be understood that the project name precedes any hyphen (-) or period (.) used at the beginning of file and folder names. File Extension Description -[templateid].insertion2 Contains structural variation information. -[templateid].sv_edges.txt Contains structural variation information. SNG Workflow Output The de novo workflows use SeqMan NGen s SNG assembler. For SNG workflows, the results folder contains the following files: 86 The Project Report Dialog User's Guide to DNASTAR SeqMan NGen 12

87 File Suffix or Extension.sqd.txt -contigs.fas -contigs.qual Description The main assembly output. To view and analyze the assembly, open this file with SeqMan Pro. (internal use only) Created when contigs are saved in FASTA format. Created when contigs are saved in FASTA format. The values in the file are the sum of the base qualities at each position in the contig, up to a maximum of 90. Note for Windows users: To open text reports with the correct formatting displayed, we recommend using Wordpad, Notepad++, or Microsoft Excel, and not the default Windows text editor, Notepad. How To View Assembly Results in SeqMan Pro After assembly is complete, click the Launch in SeqMan button in the SeqMan NGen wizard to open the assembly in SeqMan Pro. Alternatively, click the Reveal Files button to locate the folder of assembly results, which can be dragged and dropped on the SeqMan Pro application to view the assembly. Note for all users: If you ve opted to create both an *.sqd and an *.assembly output, you may notice that the files do not exactly match. That s because *.sqd files unlike *.assembly files allow sequences both before and after the template sequence. Note for Macintosh users: Macintosh only allows one copy of SeqMan Pro to be open at a time. If you are a Mac user who has opted to save your SeqMan NGen assembly in both *.sqd and *.assembly formats, the *.assembly file will be the first to open in SeqMan Pro. Once the *.sqd file has been created by SeqMan NGen, SeqMan Pro will prompt you to save the *.assembly file so it can open the *.sqd file instead. Note for Linux users: SeqMan Pro is not available on Linux. Linux users must move the finished assembly to a Windows or Macintosh computer in order to view it in SeqMan Pro. User's Guide to DNASTAR SeqMan NGen 12 How To 87

88 Assembly results will appear in the SeqMan Project Window. The contigs in your project will be named as follows: If you assembled your data using a template sequence, the resulting contig will take the name of the template sequence name. If you used repeat handling, the contigs in your project made up of sequences flagged as possible repeats will be named: Repeat-00001, Repeat-00002, Repeat-00003, etc. If you scanned your assembly for known repeats, then the contigs containing the known repeated sequences will take the name of the repeated sequence. If none of the above applies, the contigs in your project will be named Contig 00001, Contig 00002, Contig 00003, etc. Some SeqMan Pro menu options may be grayed out when working with large assembly projects such as BAM assemblies. To open your project in SeqMan Pro at a later time, use File > Open to open a SeqMan Project (*.sqd) or SeqMan NGen assembly (*.assembly); or use File > Import to open an ACE (*.ace) file. You may also drag and drop the assembly project on SeqMan Pro. If your project file or original sequence files have been moved, SeqMan Pro will prompt you to locate the sequence files. Detailed descriptions of all of SeqMan Pro s features can be accessed within SeqMan Pro by going to Help > Contents. The SeqMan Pro Help topics Discovering SNPs and Working with Features may be particularly useful. 88 How To User's Guide to DNASTAR SeqMan NGen 12

89 Create a SeqMan NGen Assembly to Use with ArrayStar ArrayStar is DNASTAR's software for gene expression analysis, gene characterization and large set analysis. ArrayStar can use SeqMan NGen assemblies as input. Here are some common workflows which involve SeqMan NGen assemblies being analyzed and viewed in ArrayStar: Import transcriptome assemblies into ArrayStar as RNA-Seq projects to perform RPK, RPM or RPKM normalization on your data. Import SNP tables from your SeqMan NGen assemblies in order to compare large sets of SNPs and affected genes using ArrayStar s visualization tools. Import SeqMan NGen assemblies into ArrayStar to perform Validation Control Accuracy testing. See Create an Assembly for Validation Control Accuracy Testing. Create an Assembly for Validation Control Accuracy Testing In SeqMan NGen, create a project with the following settings: 1) Welcome screen: Create new assembly project. 2) Choose Project Type: Exome assembly or Mendelian / germline gene panel assembly. 3) Choose Assembly Type: Templated assemblies with control. 4) Input Template Files: Use the Add Genome Package to add the associated genome (in most cases, this will be human ). In the bottom of the screen, leave both boxes checked and click the Browse button. Navigate to and open the targeted regions (*.txt or *.bed) file. 5) Input Sequence Files: Choose the Read technology, then add the read data for the Validation Control to the unpaired or paired sections. In the Experiment column, name each sample (e.g., control ). Paired reads should be given the same names. 6) Set Up Experiments: In the row with the Validation Control experiment (there may only be one row total), check the Is Control box. Use the Control Type drop-down menu to select Validation. Press the Set VCF File button and Open the VCF (*.vcf) file corresponding to the experiment. In the ensuing dialog, select VCF File for validation control experiment and click OK. 7) Assembly Options: Settings can normally be left at their defaults. 8) Perform the assembly. User's Guide to DNASTAR SeqMan NGen 12 How To 89

9) If you performed the assembly on a Windows machine and will be doing the ArrayStar analysis on the same machine, click the Validate SNPs button to launch ArrayStar and perform the analysis.

Export ArrayStar Sequences to SeqMan NGen ArrayStar is DNASTAR's software for gene expression analysis, gene characterization and large set analysis.

90 9) If you performed the assembly on a Windows machine and will be doing the ArrayStar analysis on the same machine, click the Validate SNPs button to launch ArrayStar and perform the analysis. Otherwise, see the ArrayStar help topic Validation Control Accuracy for further instructions. Export ArrayStar Sequences to SeqMan NGen ArrayStar is DNASTAR's software for gene expression analysis, gene characterization and large set analysis. Within ArrayStar, launch the Export Sequences dialog via the Data > Export Sequences command. Within this dialog, be sure to check both Export Matching Reads and Launch NGen afterwards. The latter option will cause the template and matching reads to be loaded into SeqMan NGen for assembly and downstream analysis. In this case, the template sequence will be the single GenBank or FASTA file that contains the fragments of the original template sequences that match your selection. Click the Export button to export the sequences and automatically launch SeqMan NGen. Manually Specify an Isoform By default, SeqMan NGen chooses the longest CDS as the isoform for SNP calling. If desired, you may override the automated choice by specifying the preferred isoform manually in the template sequence. To do so, follow these steps prior to importing the template sequence into SeqMan Wizard s Input Template Files screen: 1) Open the template (reference) sequence in a text editor. 2) Locate the feature of interest. Just below its location coordinates, type in /dnas_isoform. 3) Save the edited template sequence. 4) Input the saved version of the template sequence into SeqMan Wizard s Input Template Files screen. 90 How To User's Guide to DNASTAR SeqMan NGen 12

91 Make a Custom VCF File VCF files can be custom-made or automatically generated by sequencing software. For a description of various VCF version specifications, see the Sourceforge VCF Specification page. Tables in VCF files must follow all of the rules below: The first two columns must be included, and in the same order as shown below. All cells in the first two columns must be filled. Four optional columns are allowed. If optional columns are present, the assembler will check the length of the string and compare against the length of the called variant. The base identities will not be checked. Further columns are allowed, but will be ignored. IMPORTANT: The table portion of the file must be sorted numerically, first by #CHROM, and then by POS. Make sure to sort the columns numerically (1, 2, 3 ) and not alphabetically (1, 11, 12 ). If you attempt to run the assembly after loading an improperly-sorted VCF file, multiple red error messages will be displayed during the assembly. Note: When you try to open extremely large VCF files in a spreadsheet program or text editor for sorting purposes, you may receive an insufficient memory warning. If you need to sort a VCF file that is too big to open on your machine, we recommend using Sourceforge s VCFTools. If quotation marks are used anywhere in the VCF file, they must be straight quotes, not curly or smart quotes. In addition, quotation marks should not be used in lines beginning with ##contig, ##UnifiedGenotyper, or ##INFO. If these rules are not followed, an error message will appear during assembly stating that the VCF file has an incorrect or missing header. Though the assembly will continue, the VCF SNP file that is output will be empty. User's Guide to DNASTAR SeqMan NGen 12 How To 91

92 #CHROM (required) Chromosome identifier. Numbers are preferred, but chr or ch prefixes are allowed. All cells in this column must be filled. POS (required) Position in the reference sequence. All cells in this column must be filled. ID (optional) All cells must contain either: For known dbsnp entries, the rs ID For unknow n or nonexis tent IDs, a period (.) REF (optional) The reference base(s) For unknown bases, a period (.) ALT (optional) The variant base(s) For unknown bases, a period (.) INFO (optional) User ID and source assembly information For unknown bases, a period (.) Column 7 and beyond (ignored) These columns may contain data, but they will be ignored by the SeqMan NGen assembler Note: Chromosome names are captured from genome template packages and used to assign contig IDs to entries from BED, VCF and manifest files. SeqMan NGen can read and produce output using common naming conventions (i.e., chr and ch ) and Arabic numerals. It understands that chr1, ch1, or 1 can all be used to represent the first template in the index, and so on. In addition, Genome Template Packages sometimes internally define short names for particular chromosomes. For example, the C. elegans template package names its chromosomes using the standard convention for that organism: "I", "II", "III", "IV", "V", "X", "M. SeqMan NGen does not normally recognize Roman numerals, but can in this case, because the numbers are short names that have been mapped to specific chromosomes. 92 How To User's Guide to DNASTAR SeqMan NGen 12

93 Make a Custom BED File If you chose Exome assembly, Mendelian/germline gene panel assembly, or Cancer/somatic gene panel assembly from the Choose Project Type screen, you may import a targeted regions file in the Input Template Sequence screen. BED files are used to define capture regions in the assembly, and can be generated by the sequence provider or made by hand. These files are basically tab-separated text files whose extension has been changed to *.bed. See the UCSC Genome Bioinformatics BED file page for detailed information. The BED file can consist of multiple sections, each with a different track name. Text is allowed between the tables without restriction. Tables must follow all of the rules below: A header row (the one below appears in blue and black) is optional and can contain any text. The first three columns must be included, and in the same order as shown below. All cells in the first three columns must be filled. Additional table columns are allowed, but will be ignored. IMPORTANT: Each table in the file must be primarily sorted by the first column, and secondarily sorted by the second column. Make sure to sort the columns numerically (1, 2, 3 ) and not alphabetically (1, 11, 12 ). Note: If only chromosome 1 (and possibly 11) appears in SeqMan Pro s Coverage of Targeted Regions report (Project > Show coverage of target regions), this is indicative of incorrect sorting. chrom (required) The name of the chromosome or scaffold. Numbers are preferred, but chr or ch prefixes are allowed. chromstart (required) Starting position for the feature. chromend (required) Ending position for the feature. Column 4 and beyond (ignored) Data in these columns are ignored. Note: SeqMan NGen can read and produce output using a variety of common chromosome naming conventions, including chr1 and ch1, as well as Arabic and Roman numerals. Chromosome names are captured from genome template packages and used to assign contig IDs to entries from BED, VCF and manifest files. User's Guide to DNASTAR SeqMan NGen 12 How To 93

Control Automatic Software Updates SeqMan NGen respects the options you choose in SeqMan Pro regarding the automatic updating of software. To check or change the setting: 1) Launch SeqMan Pro.

94 Control Automatic Software Updates SeqMan NGen respects the options you choose in SeqMan Pro regarding the automatic updating of software. To check or change the setting: 1) Launch SeqMan Pro. 2) Choose Project > Parameters from the menu. 3) Select Internet from the list on the left. 4) Check the box next to Show a notification when a newer version of Lasergene is available if you want Lasergene to automatically check for software updates. Otherwise, uncheck the box. 5) Click OK to save your changes. 94 How To User's Guide to DNASTAR SeqMan NGen 12

Reference & Track Manager

Reference & Track Manager U SoftGenetics, LLC 100 Oakwood Avenue, Suite 350, State College, PA 16803 USA * info@softgenetics.com www.softgenetics.com 888-791-1270 2016 Registered Trademarks are property