Sep. Guide. Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

Size: px

Start display at page:

Download "Sep. Guide. Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037"

Winfred Simon
6 years ago
Views:

1 Sep 2017 DRAGEN TM Quick Start Guide Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

2 Notice Contents of this document and associated software and hardware are Copyright (c) Edico Genome Corporation. This document is proprietary to Edico Genome, and contains confidential information. Proprietary & Confidential Page 1 of 18 Edico Genome Inc.

3 Table of Contents Notice Introduction Hardware & Software Installation/Upgrade Running the Self-Test Running Your Own Test Generating a Reference (AKA Hash Table) Generating an HG19 reference Loading a Reference (AKA Hash Table) Process Your Input Data End-To-End Aligning and Variant Calling Examples Alignment Only Examples RNA Map/Align Only Examples Epigenome Map/Align Examples Variant Calling Only Examples Somatic Examples gvcf and Joint Calling Examples BCL Input Examples S3/HTTP Streaming Input Examples Cloud-Specific Notes Input file location and transfer Hashtable storage and transfer Storing Hashtables in S DNA vs RNA analysis Troubleshooting Proprietary & Confidential Page 2 of 18 Edico Genome Inc.

4 1 Introduction This Quick Start Guide will help you to start processing data as quickly as possible. It assumes the server is powered on and that you are logged in. The full User s Guide can be found on the DRAGEN Portal website 2 Hardware & Software Installation/Upgrade If you are already running the latest version of the DRAGEN software and hardware, you can skip ahead to Section 3: Running the Self-Test. Query the current version of software and hardware with the command: dragen_info -b You can find out just the software version by running the command: rpm -q edico To install a new version of software and/or hardware, first download the package from the DRAGEN Portal website onto your DRAGEN server. The preferred installation method is the self-extracting.run file: sudo sh DRAGEN_ run During installation, if you are prompted to switch to a new hardware version, enter y. It is extremely important that the hardware upgrade process is not interrupted. When it is complete, you must halt and power cycle the server (a reboot command will not update the hardware version; you must issue a halt command and power the server off and on). 3 Running the Self-Test Run the command: /opt/edico/self_test/self_test.sh This will perform a thorough test of the hardware and will take about 15 minutes. When complete, it should output: SELF TEST RESULT : PASS If there is any failure, please contact Edico Genome support. You can ignore any tests which mention NON MANDATY TEST SKIPPED. 4 Running Your Own Test Below, we outline how to optionally generate a reference (5-15 minutes), load a reference (<1 minute), and process your own data. 4.1 Generating a Reference (AKA Hash Table) If you do not have a reference, you can generate one using these instructions. You simply run a dragen build-hash-table command (example below) and pass in the location of your reference FASTA file. You Proprietary & Confidential Page 3 of 18 Edico Genome Inc.

5 can specify a set of parameters when building your hash table (see the DRAGEN User Guide for more details), but for the quick start, you can run the example shell script or simple commands below. These examples assume your FASTA file is in /staging/human/reference/hg19/hg19.fa. /opt/edico/examples/build_hash_table.sh mkdir -p /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 cd /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 dragen --build-hash-table true --ht-reference /staging/human/reference/hg19/hg19.fa --output-dir /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 The dragen --build-hash-table command is multithreaded and defaults to 8 threads, and takes about 15 minutes. You can use --ht-num-threads with a value up to 32 if your server supports that many threads, and the command will run in as little as 5 minutes. Note that the hash table directory name lists key default parameter values that were used during the hash table build. We strongly recommend following this best practice when you generate your own hash tables and change the directory name accordingly Generating an HG19 reference If you do not have a FASTA reference, you can get the hg19 FASTA files from UCSC and concatenate them into a single hg19.fa file using these instructions: mkdir /staging/hg19fa cd /staging/hg19fa wget hgdownload.cse.ucsc.edu/goldenpath/hg19/bigzips/chromfa.tar.gz tar -zxvf chromfa.tar.gz cat chr*.fa > hg19.fa Then generate the Dragen hashtable reference using these commands. This will take about 20 minutes: mkdir /staging/hg19/ /opt/edico/bin/dragen --ht-reference /staging/hg19fa/hg19.fa --output-directory /staging/hg19/ --build-hash-table true 4.2 Loading a Reference (AKA Hash Table) Once the binary reference is loaded into memory on the DRAGEN board, it can be used for processing any number of input data sets; you will not need to reload the reference unless you restart the system, or wish to switch to a different reference/hash table. The reference will be loaded automatically the first time you process data with it; however, to load the reference genome manually onto the board, use this example shell script or command (where the Proprietary & Confidential Page 4 of 18 Edico Genome Inc.

6 reference directory in this example is /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149): /opt/edico/examples/load_reference.sh dragen -l This should take less than 1 minute, and should return: DRAGEN finished normally If a manual or automatic system reset occurs, then next time you try to process data, the reference you specify on the command line will be automatically reloaded. This is also true if you reboot the system. 4.3 Process Your Input Data Once you have loaded your reference, it is time to process your input FASTQ data. Pick the example below that best matches your data sets. These commands can take up to approximately 40 minutes to run on a 24 core server with SSD drives on a 30x coverage whole human genome when running end-to-end (fastq input to VCF output). The speed scales with input size, so a 60x coverage genome would take twice as long. Exome data takes a fraction of the time. Future releases will run even faster. A successful result is indicated by: DRAGEN finished normally followed by a block of metrics such as read count and performance. If there is any problem with the command-line arguments, an error will be displayed, followed by help usage. If your terminal window is short, you may need to scroll up to see the error. The DRAGEN log can be redirected to a file, to keep the record for future reference. Notes: To get help on dragen command-line options, run: dragen -h These example commands are formatted for visual display and include line feeds, and some characters (such as the dash and double-dash) may have been changed by MS Word. To avoid copy-paste errors, each example command is contained in an individual shell script in /opt/edico/examples/. All commands can accept either FASTQ or gzipped FASTQ (fastq.gz). DRAGEN will automatically determine which file type it is. All of these sample commands include the -f option, which will force the output file to be overwritten if it already exists. These commands all assume that your DRAGEN reference (hash table) directory is /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149, and your FASTA reference file is /staging/human/reference/hg19/hg19.fa. Replace those with the correct references if needed. These examples assume that the example data package is present in /staging/examples (in particular, the fastq and fastq.gz files are expected to be in /staging/examples/reads). Proprietary & Confidential Page 5 of 18 Edico Genome Inc.

7 4.3.1 End-To-End Aligning and Variant Calling Examples 1. Paired-End Fastq Input, VCF Output (Default) /opt/edico/examples/paired_fastq_in_vcf_out.sh This command should take about 6 minutes on a 24-core server. This example illustrates the minimum parameters that must be specified to perform an end-to-end run. Note that by default, duplicate-marking is not performed. If you want to perform duplicate marking, see the following example in 2. Note that no BAM output is produced by default. If you want that along with the VCF file, see the example in 3. The user may optionally combine any of these per the desired use case. 2. Paired-End Fastq Input, Sorted and Duplicate-Marked, VCF Output /opt/edico/examples/paired_fastq_in_dupmark_vcf_out.sh --enable-duplicate-marking true 3. Paired-End Fastq Input, Sorted BAM and VCF Output /opt/edico/examples/paired_fastq_in_dupmark_bam_and_vcf_out.sh Proprietary & Confidential Page 6 of 18 Edico Genome Inc.

8 --enable-duplicate-marking true --enable-map-align-output true 4. Paired-End Fastq Input, Sorted SAM and VCF Output /opt/edico/examples/paired_fastq_in_dupmark_sam_and_vcf_out.sh --enable-duplicate-marking true --enable-map-align-output true --output-format SAM 5. Paired-End Fastq Input, Sorted CRAM and VCF Output /opt/edico/examples/paired_fastq_in_dupmark_cram_and_vcf_out.sh --enable-duplicate-marking true --enable-map-align-output true --output-format CRAM --cram-reference /staging/human/reference/hg19/hg19.fa Alignment Only Examples All of the variations for performing alignment shown in these examples can be used in the end-to-end case as well. 1. Map/Align Single-Ended FASTQ Input, Sorted BAM output (Default) /opt/edico/examples/single_fastq_in_bam_out.sh dragen f Proprietary & Confidential Page 7 of 18 Edico Genome Inc.

9 -1 /staging/examples/reads/sra056922_30x_rand1_100k.fastq --output-file-prefix SRA056922_30x_rand1_100K 2. Map/Align Single-ended FASTQ input, Sorted, Duplicate-Marked BAM Output /opt/edico/examples/single_fastq_in_dupmark_bam_out.sh dragen f -1 /staging/examples/reads/sra056922_30x_rand1_100k.fastq --output-file-prefix SRA056922_30x_rand1_100K_dup_marked --enable-duplicate-marking true 3. Map/Align Paired-End FASTQ Input, Sorted BAM Output (Default) /opt/edico/examples/paired_fastq_in_bam_out.sh dragen f 4. Map/Align Paired-End FASTQ Input, Sorted CRAM Output /opt/edico/examples/paired_fastq_in_cram_out.sh dragen f --cram-reference /staging/human/reference/hg19/hg19.fa --output-format CRAM 5. Map/Align Paired-End FASTQ Input, Sorted Uncompressed BAM Output /opt/edico/examples/paired_fastq_in_uncompressed_bam_out.sh Proprietary & Confidential Page 8 of 18 Edico Genome Inc.

10 --output-file-prefix uncompressed_sra056922_30x_e10_50m --enable-bam-compression false 6. Map/Align Paired-End FASTQ Input, Sorted SAM Output /opt/edico/examples/paired_fastq_in_sam_out.sh --output-format SAM 7. Map/Align Paired -End FASTQ Input, UN-Sorted BAM output /opt/edico/examples/paired_fastq_in_unsorted_bam_out.sh --output-file-prefix unsorted_sra056922_30x_e10_50m --enable-sort false 8. Map/Align Interleaved Paired-Ended FASTQ Input, BAM Output /opt/edico/examples/interleaved_fastq_in_bam_out.sh dragen f -1 /staging/examples/reads/sra056922_pe_30x_rand1_10k_interleaved.fastq --interleaved --output-file-prefix SRA056922_PE_30x_rand1_10K_interleaved RNA Map/Align Only Examples Any of the Map/Align Only examples can be used for RNA. The only difference in running it is to add the option --enable-rna true to the command line. DRAGEN will automatically pick up the RNA specific hash tables and use the RNA spliced aligner in its processing. 1. RNA Map/Align Paired-Ended FASTQ Input, BAM Output dragen f Proprietary & Confidential Page 9 of 18 Edico Genome Inc.

11 --enable-rna true Epigenome Map/Align Examples Prior to performing an epigenome (methylation) Map/Align run with bisulfite sequencing data you must first create methylation-specific reference hash tables: mkdir -p /staging/human/reference/hg19_epigenome dragen --build-hash-table true --ht-reference /staging/human/reference/hg19/hg19.fa --ht-max-seed-freq 64 --ht-seed-len 27 --ht-methylated true --output-directory /staging/human/reference/hg19_epigenome The above DRAGEN command will produce two hash table directories under /staging/human/reference/hg19_epigenome: GA_converted and CT_converted. The CT_converted hash table is produced by converting each C base to T in the reference sequences. Similarly, the GA_converted hash table is produced from the G->A base-converted reference sequences. The baseconverted references have less complexity, and to compensate we typically increase the hash table seed length argument (--ht-seed-len) to 27 for mammalian genomes (default seed length is 21). 1. Epigenome Map/Align, Directional-protocol, Single-Ended FASTQ Input, BAM Output The directional (Lister) protocol produces reads from two of the four possible bisulfite sequencing strands (see Section 6 of User Guide). Consequently, when the --methylation-protocol=directional argument is used, DRAGEN will align each read or read pair twice with different constraints corresponding to the two possible strands. The following DRAGEN command will produce two separate BAM files: mkdir p /staging/epigenome/directional dragen --output-directory /staging/epigenome/directional --methylationprotocol=directional r /staging/human/reference/hg19_epigenome --fastqfile1=/staging/epigenome/reads/sample_1_r1.fastq.gz --RGID=rg1 --RGSM=samp1 -- RGPL=illumina --output-file-prefix=sample_1 2. Epigenome Map/Align, Non-directional-protocol, Paired-Ended FASTQ Input, BAM Output As described in Section 6 of the User Guide, the non-directional protocol produces reads from all four possible bisulfite sequencing strands. Consequently, when the --methylation-protocol=non-directional argument is used, DRAGEN will align each read four times and produce four separate BAM files. mkdir p /staging/epigenome/non-directional dragen --output-directory /staging/epigenome/non-directional --methylationprotocol=non-directional r /staging/human/reference/hg19_epigenome --fastqfile1=/staging/epigenome/reads/sample_10_r1.fastq.gz --fastqfile2=/staging/epigenome/reads/sample_10_r2.fastq.gz --RGID=rg10 --RGSM=samp10 -- RGPL=illumina --output-file-prefix=sample_10 Proprietary & Confidential Page 10 of 18 Edico Genome Inc.

12 4.3.5 Variant Calling Only Examples The examples shown in this section demonstrate how you can pass an existing aligned BAM or CRAM file directly to the DRAGEN Variant Caller. By default, the BAM/CRAM file will pass through the sorting stage prior to variant calling. If it is already sorted, then you can save some time by disabling the sort step. NOTE: If you need to duplicate mark your BAM file before running the DRAGEN Variant Caller, you will need to use a separate tool for that step. The DRAGEN Duplicate Marker depends on information provided by the Mapper/Aligner which does not exist in BAM files. To take advantage of the DRAGEN Duplicate Marker, use DRAGEN in end-to-end mode. Note: The BAM/CRAM files which are used as input to these example commands, are not included in the example data set. They are generated by a previous example commands in the Alignment Only Examples above. 1. Unsorted BAM Input, VCF Output (Default) /opt/edico/examples/unsorted_bam_in_vcf_out.sh -b /staging/human/unsorted_sra056922_30x_e10_50m.bam --output-file-prefix unsorted_output_sra056922_30x_e10_50m 2. Sorted BAM Input, VCF Output /opt/edico/examples/sorted_bam_in_vcf_out.sh -b /staging/human/sra056922_30x_e10_50m.bam --output-file-prefix sorted_output_sra056922_30x_e10_50m --enable-sort false 3. Sorted CRAM Input, VCF Output /opt/edico/examples/sorted_cram_in_vcf_out.sh Proprietary & Confidential Page 11 of 18 Edico Genome Inc.

13 --output-file-prefix sorted_output_sra056922_30x_e10_50m --enable-sort false --cram-reference /staging/human/reference/hg19/hg19.fa --cram-input /staging/human/sra056922_30x_e10_50m.cram Proprietary & Confidential Page 12 of 18 Edico Genome Inc.

14 4.3.6 Somatic Examples 1. Paired-End Fastq Input --tumor-fastq1 /staging/examples/reads/sra056922_30x_shuffle16k_e10_50m_1.fastq.gz --tumor-fastq2 /staging/examples/reads/sra056922_30x_shuffle16k_e10_50m_2.fastq.gz 2. Sorted BAM Input --tumor-bam-input /staging/human/sra056922_30x_e10_50m.bam --output-file-prefix sorted_output_sra056922_30x_e10_50m Proprietary & Confidential Page 13 of 18 Edico Genome Inc.

15 4.3.7 gvcf and Joint Calling Examples 1. Paired-End Fastq Input, gvcf Output --vc-emit-ref-confidence GVCF 2. Joint Calling with gvcf input --enable-joint-genotyping true --output-file-prefix Joint_SRA056922_30x_e10_50M --variant /staging/examples/sra056922_30x_e10_50m.gvcf Proprietary & Confidential Page 14 of 18 Edico Genome Inc.

16 4.3.8 BCL Input Examples In this section we demonstrate how to use DRAGEN to process Illumina s BCL format files. DRAGEN can use BCL input to produce FASTQ files very quickly. With some limitations, it can also use BCL input directly to perform Map-Align and optionally Variant Calling, saving the time and space required to perform conversion to FASTQ. Note: The BCL directory in these examples is not included in the example data package. Please replace /mnt/san/131022_hsxten008_0123_fc543 with your own BCL directory. 1. BCL to FASTQ conversion with minimal settings This example shows how to convert data from the BCL format to FASTQ files. Note that DRAGEN will produce multiple files per sample with names like <SampleName>_001.fastq, <SampleName>_002.fastq, etc. There is no need to concatenate these files before performing Map-Align using DRAGEN: specifying the first file in the series will cause DRAGEN to read all of them as if they were concatenated into one file. dragen --bcl-conversion-only=true --bcl-input-dir /mnt/san/131022_hsxten008_0123_fc543 --bcl-output-dir /staging/examples/ 2. Map/Align BCL Lane 1 Input, Sorted BAM output (Default) This example performs Map-Align operation directly from BCL, outputting a sorted BAM file. Note that a single lane must be specified, and that lane must have a single entry in the SampleSheet.csv file (nonindexed BCL). dragen --bcl-input-dir /mnt/san/131022_hsxten008_0123_fc543 --bcl-only-lane 1 --output-file-prefix SRA056922_30x_rand1_100K 3. BCL Lane 3 Input, VCF Output (Default) This full-pipeline run is subject to the same BCL streaming limitations as the example above: a single, nonindexed BCL lane. dragen --bcl-input-dir /mnt/san/131022_hsxten008_0123_fc543 --bcl-only-lane 3 --output-file-prefix SRA056922_30x_rand1_100K S3/HTTP Streaming Input Examples DRAGEN is capable of processing input files directly from an S3 bucket, or using HTTP pre-signed URLs. In the context of the DRAGEN pipeline, this is known as input streaming. The input files need not be downloaded to a local disk prior to it being processed. Instead, the files are streamed over the network directly into the DRAGEN processor. Streaming is supported for Compressed FASTQ (*.fastq.gz) files. A future version of DRAGEN will also support streaming from BAM (*.bam) files. Proprietary & Confidential Page 15 of 18 Edico Genome Inc.

17 Furthermore, streaming can be utilized in all of the configurations that use these file types ie single-end FASTQs, paired end FASTQs, and FASTQ lists. The following examples showcase some of the methods that can benefit from input streaming. 1. Streaming FASTQ Input using S3-1 s3://s3-bucket-name/path/to/object_1.fastq.gz -2 s3://s3-bucket-name/path/to/object_2.fastq.gz --output-file-prefix streaming 2. Streaming FASTQ Input using HTTP output-file-prefix streaming In general, the user will require permissions to be able to access the remote files. If the file is accessible to the user running DRAGEN, then DRAGEN is capable of streaming the remote file. The S3 object will require AWS authentication and credentials. This should already be set up on the instance you are running, for example, via IAM policies. The HTTP URL will most likely have a query string attached to it, which will provide the authentication credentials or necessary tokens to grant permission. The security method may be present in other parts of the URL, for example: Proprietary & Confidential Page 16 of 18 Edico Genome Inc.

18 5 Cloud-Specific Notes See the appropriate AWS Marketplace Quick-Start Guide, or AMI Quick-Start Guide, for information on allocating and configuring the f1 instances. It is assumed that those instructions have already been performed at this point. When running Dragen in the cloud (on AWS f1 instances, using a Dragen AMI), there are some additional things to keep in mind: Input file location and transfer Hashtable location and transfer 5.1 Input file location and transfer DRAGEN can stream FASTQ.gz and BAM input files directly from S3, so the user does not need to manually copy the input files to the instance first. See Chapter for example usage. A future version of Dragen may be able to stream output files (BAM, VCF) directly to S Hashtable storage and transfer The hashtable reference is 32-64GB and is required for all DRAGEN runs. These are not included in the AMI because references are usually customer-specific, and would make the AMI too large. See instructions in Chapter 4.1 for generating a Hashtable reference. The end-user is responsible for storing the hashtable references, and copying them to the instance. Edico s recommendations are given below. A future version of DRAGEN may be able to stream the Hashtable references directly from S Storing Hashtables in S3 Edico has determined that good performance (with the least maintenance) is achieved by storing the hashtables as.tar files in S3 (for example, hg19.tar, GRCh37.tar, etc); then copying the single tar file to the f1 instance, and un-tar ing it before DRAGEN runs. If the hashtable is stored as a.tar.gz in S3, it is slightly smaller which results in a slightly shorter download time, but it takes much more time to gunzip the file (5-10 minutes). This is not recommended. If the hashtable is stored as individual files in a directory structure in S3, then the files may be downloaded in parallel, resulting in a slight performance improvement; also the un-tar step can be skipped, saving 2-5 minutes. However there may be some long-term maintenance required because the filenames contained within a hashtable could change with newer versions of DRAGEN. Users may also experiment with storing Hashtables on EFS volumes which are shared across f1 instances; however, in our testing, EFS volumes with <1TB of data are not performant, and are much more expensive than S DNA vs RNA analysis If you are performing only DNA analysis, but your hashtable contains RNA information, you can decrease the size of it by 50% by simply deleting the entire anchored_rna/ subdirectory. Some newer versions of Dragen allow the hashtable to be generated without RNA information by default. Proprietary & Confidential Page 17 of 18 Edico Genome Inc.

19 The command lines to run analysis on DRAGEN are similar to those provided as examples in Section 4.3 Please note that the examples in section 4.3 are comprehensive and cover many option that DRAGEN supports in its on-site solution. The Cloud applications have limited functionality today in term of different pipelines but more applications will be added in the future. 6 Troubleshooting The DRAGEN software will automatically reset the board if any problems are encountered. In the rare case that this doesn t occur automatically, you can issue this command: dragen_reset If this does not resolve the issue, please use the DRAGEN Portal to create a support ticket and attach the results produced by the following command: sudo sosreport --batch This tool will take several minutes to execute and will report the location where it has saved the diagnostic information in /tmp. For more details, please see the DRAGEN User Guide which is available from the DRAGEN Portal. Proprietary & Confidential Page 18 of 18 Edico Genome Inc.

Mar. Guide. Edico Genome Inc North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

Mar. Guide. Edico Genome Inc North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Mar 2017 DRAGEN TM Quick Start Guide www.edicogenome.com info@edicogenome.com Edico Genome Inc. 3344 North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Notice Contents of this document and associated